Intermediacy of publications

Citation networks of scientific publications offer fundamental insights into the structure and development of scientific knowledge. We propose a new measure, called intermediacy, for tracing the historical development of scientific knowledge. Given two publications, an older and a more recent one, intermediacy identifies publications that seem to play a major role in the historical development from the older to the more recent publication. The identified publications are important in connecting the older and the more recent publication in the citation network. After providing a formal definition of intermediacy, we study its mathematical properties. We then present two empirical case studies, one tracing historical developments at the interface between the community detection literature and the scientometric literature and one examining the development of the literature on peer review. We show both conceptually and empirically how intermediacy differs from main path analysis, which is the most popular approach for tracing historical developments in citation networks. Main path analysis tends to favour longer paths over shorter ones, whereas intermediacy has the opposite tendency. Compared to the main path analysis, we conclude that intermediacy offers a more principled approach for tracing the historical development of scientific knowledge.

This manuscript was compiled on December 21, 2018 Citation networks of scientific publications offer fundamental insights into the structure and development of scientific knowledge. We propose a new measure, called intermediacy, for tracing the historical development of scientific knowledge. Given two publications, an older and a more recent one, intermediacy identifies publications that seem to play a major role in the historical development from the older to the more recent publication. The identified publications are important in connecting the older and the more recent publication in the citation network. After providing a formal definition of intermediacy, we study its mathematical properties. We then present two empirical case studies, one tracing historical developments at the interface between the community detection and the scientometric literature and one examining the development of the literature on peer review. We show both mathematically and empirically how intermediacy differs from main path analysis, which is the most popular approach for tracing historical developments in citation networks. Main path analysis tends to favor longer paths over shorter ones, whereas intermediacy has the opposite tendency. Compared to main path analysis, we conclude that intermediacy offers a more principled approach for tracing the historical development of scientific knowledge.
intermediacy | publication | citation network | main path analysis C itation networks provide invaluable information for tracing historical developments in science. The idea of tracing scientific developments based on citation data goes back to Eugene Garfield, the founder of the Science Citation Index. In a report published more than 50 years ago, Garfield and his co-workers concluded that citation analysis is "a valid and valuable means of creating accurate historical descriptions of scientific fields" (1). Garfield also developed a software tool called HistCite that visualizes citation networks of scientific publications. This tool supports users in tracing historical developments in science, a process sometimes referred to as algorithmic historiography by Garfield (2)(3)(4). More recently, a software tool called CitNetExplorer (5) was developed that has similar functionality but offers more flexibility in analyzing large-scale citation networks. Other software tools, most notably CiteSpace (6) and CRExplorer (7,8), provide alternative approaches for tracing scientific developments based on citation data.
Main path analysis, originally proposed by Hummon and Doreian (9), is a widely used technique for tracing historical developments in science. Given a citation network, main path analysis identifies one or more paths in the network that are considered to represent the most important scientific developments. Many variants and extensions of main path analysis have been proposed (10)(11)(12)(13)(14)(15)(16), not only for citation networks of scientific publications but also for patent citation networks (17)(18)(19)(20)(21).
In this paper, we introduce a new approach for tracing historical developments in science based on citation networks. We propose a measure called intermediacy. Given two publications dealing with a specific research topic, an older publication and a more recent one, intermediacy can be used to identify publications that appear to play a major role in the historical development from the older to the more recent publication. These are publications that, based on citation links, are important in connecting the older and the more recent publication.
Like main path analysis, intermediacy can be used to identify one or more citation paths between two publications. However, as we will make clear, there are fundamental differences between intermediacy and main path analysis. Most significantly, we will show that main path analysis tends to favor longer citation paths over shorter ones, whereas intermediacy has the opposite tendency. For the purpose of tracing historical developments in science, we argue that intermediacy yields better results than main path analysis.

Intermediacy
Consider a directed acyclic graph G = (V , E), where V denotes the set of nodes of G and E denotes the set of edges of G. The edges are directed. We are interested in the connectivity between a source s ∈ V and a target t ∈ V . Only nodes that are located on a path from source s to target t are of relevance. We refer to such a path as a source-target path. We assume that each node v ∈ V is located on a source-target path. Definition 1. Given a source s and a target t, a path from s to t is called a source-target path.
In this paper, our focus is on citation networks of scientific publications. In this context, nodes are publications and

Significance Statement
Researchers spend a lot of time keeping track of the literature in their field. Computational methods can be used to increase the efficiency with which researchers study the literature. We propose a method called intermediacy that enables tracing the historical development of scientific knowledge. Based on citation relations, intermediacy aims to identify publications that play a major role in the historical development from an older publication to a more recent one. Main path analysis currently is the most commonly used approach for addressing this problem. We show the advantages of intermediacy over main path analysis. When implemented in interactive search interfaces, intermediacy may help to significantly increase the efficiency with which researchers study the literature in their field.  For p → 0, intermediacy favors nodes located on shorter paths and therefore node u has a higher intermediacy than node v . For p → 1, intermediacy favors nodes located on a larger number of edge independent paths and therefore node v has a higher intermediacy than node u.
(B) Illustration of the choice of the parameter p. Nodes u and v are connected by a single direct path in the left graph and by k indirect paths of length 2 in the right graph.
For different values of k, the bar chart shows the values of p for which the probability that there is an active path from node u to node v is higher (in orange) or lower (in gray) in the left graph than in the right graph.
edges are citations. We choose edges to be directed from a citing publication to a cited publication. Hence, edges point backward in time. This means that the source is a more recent publication and the target an older one. Informally, the more important the role of a node v ∈ V in connecting source s to target t, the higher the intermediacy of v. To formally define intermediacy, we assume that each edge e ∈ E is active with a certain probability p. We assume that the probability of being active is the same for all edges e ∈ E. Based on the idea of active and inactive edges, we introduce the following definitions.

Definition 2.
If all edges on a path are active, the path is called active. Otherwise the path is called inactive. If a node v ∈ V is located on an active source-target path, the node is called active. Otherwise the node is called inactive.
For two nodes u, v ∈ V , we use Xuv to indicate whether there is an active path (or multiple active paths) from node u to node v (Xuv = 1) or not (Xuv = 0). The probability that there is an active path from node u to node v is denoted by Pr(Xuv = 1). We use Xst(v) to indicate whether there is an active source-target path that goes through node v (Xst(v) = 1) or not (Xst(v) = 0). The probability that there is an active source-target path that goes through node v is denoted by Pr(Xst(v) = 1) = Pr(Xsv = 1) Pr(Xvt = 1). This probability equals the probability that node v is active.
Intermediacy can now be defined as follows.

Definition 3.
The intermediacy φv of a node v ∈ V is the probability that v is active, that is, φv = Pr(Xst(v) = 1) = Pr(Xsv = 1) Pr(Xvt = 1). [1] In the interpretation of intermediacy, we focus on the ranking of nodes relative to each other. We do not consider the absolute values of intermediacy. For instance, suppose the intermediacy of node v ∈ V is twice as high as the intermediacy of node u ∈ V . We then consider node v to be more important than node u in connecting the source s and the target t. However, we do not consider node v to be twice as important as node u.
We now present an analysis of the mathematical properties of intermediacy. The proofs of the mathematical results provided below can be found in the Materials and Methods section.

Limit behavior.
To get a better understanding of intermediacy, we study the behavior of intermediacy in two limit cases, namely the case in which the probability p that an edge is active goes to 0 and the case in which the probability p goes to 1. In each of the two cases, the ranking of the nodes in a graph based on intermediacy turns out to have a natural interpretation. The difference between the two cases is illustrated in Fig. 1A.
Let v denote the length of the shortest source-target path going through node v ∈ V . The following theorem states that in the limit as the probability p that an edge is active tends to 0, the ranking of nodes based on intermediacy coincides with the ranking based on v . Nodes located on shorter sourcetarget paths are more intermediate than nodes located on longer source-target paths. Theorem 1. In the limit as the probability p tends to 0, The intuition underlying this theorem is as follows. When the probability that an edge is active is close to 0, almost all edges are inactive. Consequently, almost all source-target paths are inactive as well. However, from a relative point of view, longer source-target paths are more likely to be inactive than shorter source-target paths. This means that nodes located on shorter source-target paths are more likely to be active than nodes located on longer source-target paths (even though for all nodes the probability of being active is close to 0). Nodes located on shorter source-target paths therefore have a higher intermediacy than nodes located on longer source-target paths.
We now consider the limit case in which the probability p that an edge is active goes to 1. Let σv denote the number of edge independent source-target paths going through node v ∈ V . Theorem 2 states that in the limit as p tends to 1, the ranking of nodes based on intermediacy coincides with the ranking based on σv. The larger the number of edge independent source-target paths going through a node, the higher the intermediacy of the node. Theorem 2. In the limit as the probability p tends to 1, σu > σv implies φu > φv.
Intuitively, this theorem can be understood as follows. When the probability that an edge is active is close to 1, almost all edges are active. Consequently, almost all sourcetarget paths are active as well, and so are almost all nodes. A node is inactive only if all source-target paths going through the node are inactive. If there are σ edge independent sourcetarget paths that go through a node, this means that the node can be inactive only if there are at least σ inactive edges. Consider two nodes u, v ∈ V . Suppose that the number of edge independent source-target paths going through node v is larger than the number of edge independent source-target paths going through node u. In order to be inactive, node v then requires more inactive edges than node u. This means that node v is less likely to be inactive than node u (even though for both nodes the probability of being inactive is close to 0). Hence, node v has a higher intermediacy than node u. More generally, nodes located on a larger number of edge independent source-target paths have a higher intermediacy than nodes located on a smaller number of edge independent source-target paths.
Parameter choice. The probability p that an edge is active is a free parameter of intermediacy for which one needs to choose an appropriate value. The results presented above are concerned with the behavior of intermediacy in the limit cases in which the probability p tends to either 0 or 1. Fig. 1B provides some insight into the behavior of intermediacy for values of the probability p that are in between these two extremes. The figure shows two graphs. In the left graph, there is a direct path (i.e., a path of length 1) from node u to node v. There are no indirect paths. In this graph, the probability that there is an active path from u to node v equals p. In the right graph, there is no direct path from node u to node v, but there are k indirect paths of length 2. Each of these paths has a probability of p 2 of being active. Consequently, the probability that there is at least one active path from node u to node v equals 1 − (1 − p 2 ) k . The bar chart in Fig. 1B shows for different values of k the values of p for which the probability that there is an active path from node u to node v is higher (in orange) or lower (in gray) in the left graph than in the right graph. For instance, suppose that k = 5. For p < 0.22, the probability that there is an active path from node u to node v is higher in the left graph than in the right graph. For p > 0.22, the situation is the other way around. If the probability p that an edge is active is set to 0.22, a direct path between two nodes is considered equally strong as 5 indirect paths of length 2. Based on Fig. 1B, one can set the probability p to a value that one considers appropriate for a particular analysis.
Path addition and contraction. Next, we study two additional properties of intermediacy, the property of path addition and the property of path contraction. We show that both adding paths and contracting paths lead to an increase in intermediacy. Path addition and path contraction are important properties because they reflect the basic intuition underlying the idea of intermediacy.
We start by considering the property of path addition. We define path addition as follows.

Definition 4. Consider a directed acyclic graph G = (V , E)
and two nodes u, v ∈ V such that there does not exist a path from node v to node u. Path addition is the operation in which a new path from node u to node v is added. Let denote the length of the new path. If = 1, an edge This definition includes the condition that there does not exist a path from node v to node u. This condition ensures that the graph G will remain acyclic after adding a path. The following theorem states that adding a path increases intermediacy.
Theorem 3. Consider a directed acyclic graph G = (V , E), a source s ∈ V , and a target t ∈ V . In addition, consider two nodes u, v ∈ V such that there does not exist a path from node v to node u. Adding a path from node u to node v increases the intermediacy φw of any node w ∈ V located on a path from source s to node u or from node v to target t.
Theorem 3 does not depend on the probability p. Adding a path always increases intermediacy, regardless of the value of p. To illustrate the theorem, consider Fig. 2A and Fig. 2B. The graph in Fig. 2B is identical to the one in Fig. 2A except that a path from node u to node v has been added. As can be seen, adding this path has increased the intermediacy of nodes located between source s and node u or between node v and target t, including nodes u and v themselves. While the intermediacy of other nodes has not changed, the intermediacy of these nodes has increased from 0.17 to 0.23. This reflects the basic intuition that, after a path from node u to node v has been added, going from source s to target t through nodes u and v has become 'easier' than it was before. This means that nodes located between source s and node u or between node v and target t have become more important in connecting the source and the target. Consequently, the intermediacy of these nodes has increased.
We now consider the property of path contraction. We use Vuv to denote the set of all nodes located on a path from node u to node v, including nodes u and v themselves. Path contraction is then defined as follows.

Definition 5. Consider a directed acyclic graph G = (V , E)
and two nodes u, v ∈ V such that there exists at least one path from node u to node v. Path contraction is the operation in which all nodes in Vuv are contracted. This means that the nodes in Vuv are replaced by a new node r. Edges pointing from a node w / ∈ Vuv to nodes in Vuv are replaced by a single new edge (w, r). Edges pointing from nodes in Vuv to a node w / ∈ Vuv are replaced by a single new edge (r, w). Edges between nodes in Vuv are removed.
The following theorem states that contracting paths increases intermediacy.
Theorem 4. Consider a directed acyclic graph G = (V , E), a source s ∈ V , and a target t ∈ V . In addition, consider two nodes u, v ∈ V such that there exists at least one path from node u to node v and such that nodes in Vuv do not have neighbors outside Vuv except for incoming neighbors of node u and outgoing neighbors of node v. Contracting paths from node u to node v increases the intermediacy φw of any node w ∈ V located on a path from source s to node u or from node v to target t.
Like Theorem 3, Theorem 4 does not depend on the probability p. Theorem 4 is illustrated in Fig. 2B and Fig. 2C. The graph in Fig. 2C is identical to the one in Fig. 2B except that paths from node u to node v have been contracted. As a result, there has been an increase in the intermediacy of nodes located between source s and node u or between node v and target t, including nodes u and v themselves (which have been contracted into a new node r). While the intermediacy of other nodes has not changed, the intermediacy of these nodes has increased from 0.23 to 0.34. This reflects the basic intuition that, after paths from node u to node v have been contracted, going from source s to target t through nodes u and v has become 'easier' than it was before. In other words, nodes located on a path from source s to target t going through nodes u and v have become more important in connecting the source and the target, and hence the intermediacy of these nodes has increased.
Alternative approaches. How does intermediacy differ from alternative approaches? We consider two alternative approaches. One is main path analysis (9). This is the most commonly used approach for tracing the historical development of scientific knowledge in citation networks. The other alternative approach is the expected path count approach. Like intermediacy, the expected path count approach distinguishes between active and inactive edges and focuses on active source-target paths. While intermediacy considers the probability that there is at least one active source-target path going through a node, the expected path count approach considers the expected number of active source-target paths that go through a node.
Consider the graph shown in Fig. 3A. To get from source s to target t, one could take either a path going through nodes u and v or the path going through node w. Based on intermediacy, the latter path represents a stronger connection between the source and the target than the former one. This follows from the path contraction property.
Interestingly, main path analysis gives the opposite result, as can be seen in Fig. 3B. For each edge, the figure shows the search path count, which is the number of source-target paths that go through the edge. There are two source-target paths that go through (s, u) and (v, t), while all other edges are included only in a single source-target path. Because the search path counts of (s, u) and (v, t) are higher than the search path counts of (s, w) and (w, t), main path analysis favors paths going through nodes u and v over the path going through node w. This is exactly opposite to the result obtained using intermediacy. Fig. 3B makes clear that main path analysis yields outcomes that violate the path contraction property. Main path analysis tends to favor longer paths over shorter ones. For the purpose of identifying publications that play an important role in connecting an older and a more recent publication, we consider this behavior to be undesirable. There are various variants of main path analysis, which all show the same type of undesirable behavior. Instead of focusing on the probability of the existence of at least one active source-target path, as is done by intermediacy, one could also focus on the expected number of active sourcetarget paths going through a node. This alternative approach, which we refer to as the expected path count approach, is illustrated in Fig. 3C. As can be seen in the figure, nodes u and v have a higher expected path count than node w. Paths going through nodes u and v may therefore be favored over the path going through node w. Fig. 3C shows that, unlike intermediacy, the expected path count approach does not have the path contraction property. Depending on the probability p, contracting paths may cause expected path counts to decrease rather than increase. Because the expected path count approach does not have the path contraction property, we do not consider this approach to be a suitable alternative to intermediacy.

Empirical analysis
We now present two case studies that serve as empirical illustrations of the use of intermediacy. Case 1 deals with the topic of community detection and its relationship with scientometric research. This case was selected because we are well acquainted with the topic. Case 2 deals with the topic of peer review. This case is of interest because it was recently examined using main path analysis (22). Hence, it enables us to demonstrate the key differences between intermediacy and main path analysis. In both case studies, the intermediacy of publications was calculated using the Monte Carlo algorithm presented in the Materials and Methods section.

Case 1: Community detection and scientometrics.
We analyze how a method for community detection in networks ended up being used in the field of scientometrics to construct classification systems of scientific publications. In particular, we are interested in the development from Newman and Girvan (2004) to Klavans and Boyack (2017). These are our target and source publications. Newman and Girvan (2004) introduced a new measure for community detection in networks, known as modularity, while Klavans and Boyack (2017) compared different ways in which modularity-based approaches can be used to identify communities in citation networks.
Our analysis relies on data from the Scopus database produced by Elsevier. We also considered the Web of Science database produced by Clarivate Analytics. However, many citation links relevant for our analysis are missing in Web of Science. There are also missing citation links in Scopus, but for Scopus the problem is less significant than for Web of   Klavans (2017) Waltman (2013) Waltman (2012) Hric (2014) Fortunato (2010) Newman (2006) Ruiz-Castillo (2015) Blondel (2008) Newman (2006) Newman (2004) Rosvall ( Science. We refer to Van Eck and Waltman (23) for a further discussion of the problem of missing citation links.
In the Scopus database, we found n = 64 223 publications that are located on a citation path between our source and target publications. In total, we identified m = 280 033 citation links between these publications. This means that on average each publication has k = 2m/n ≈ 8.72 citation links, counting both incoming and outgoing links. Fig. 4A shows how the probability of the existence of an active path between the source and target publications depends on the parameter p. This probability increases from zero for p = 0 to almost one starting from p = 0.25. The vertical line indicates the value p = 1/k. At this value, traditional percolation theory for random graphs suggests that the probability that the source and target publications are connected becomes non-negligible (24). When searching for a suitable value of p, the value p = 1/k suggested by percolation theory may serve as a reasonable starting point. In our case, this yields p ≈ 1/8.72 ≈ 0.11, resulting in a probability of about 0.40 for the existence of an active source-target path.
For five different values of the parameter p, Fig. 4B shows the cumulative distribution of the intermediacy scores of our n = 64 223 publications. As is to be expected, when p is close to zero, intermediacy scores are extremely small. On the other hand, when p is getting close to one, intermediacy scores also approach one.   Based on our expert knowledge of the topic under study, we found that the most useful results were obtained by setting the parameter p equal to 0.1. Table 1 lists the ten publications with the highest intermediacy for p = 0.1. For each publication, the intermediacy is reported for five different values of p. In addition, the table also reports each publication's citation count and reference count. Fig. 4E shows the citation network of the ten most intermediate publications for p = 0.1.
Using our expert knowledge to interpret the results presented in Table 1 Table 1 and Fig. 4E are classical publications on community detection in general and modularity in particular. The publications by Newman all deal with modularity-based community detection. Rosvall and Bergstrom (2008) proposed an alternative approach to community detection. They applied their approach to a citation network of scientific journals, which explains the connection with the scientometric literature. Fortunato (2010) is a review of the literature on community detection. The intermediacy of this publication is probably strongly influenced by its large number of references. Hric et al. (2014) is a more recent publication on community detection. This publication focuses on the challenges of evaluating the results produced by community detection methods. This issue is very relevant in a scientometric context, and therefore the publication was cited by our source publication (Klavans & Boyack, 2017). Finally, there is one more scientometric publication in Table 1 and Fig. 4E. This publication (Ruiz-Castillo & Waltman, 2015) is one of the first studies presenting a scientometric application of classification systems of scientific publications constructed using a modularity-based approach. The publication was also cited by our source publication.
The citation counts reported in Table 1 show that some publications, especially the more recent ones, have a high intermediacy even though they have been cited only a very limited number of times. This makes clear that a ranking of publications based on intermediacy is quite different from a citation-based ranking of publications. The publications in Table 1 that have a high intermediacy and a small number of citations do have a substantial number of references.
Case 2: Peer review. We now turn to case 2, in which we analyze the literature on peer review. The analysis is based on data from the Web of Science database. We make use of the same data that was also used in a recent paper by Batagelj et al. (22).
We started with a citation network of 45 965 publications dealing with peer review. This is the citation network that was labeled CiteAcy by Batagelj et al. (22). We selected Cole and Cole (1967)  As can be seen in Fig. 5A, percolation theory suggests a value of 1/k ≈ 1/11.12 ≈ 0.09 for the parameter p. This is close to the value of 0.11 obtained in case 1. However, the probability of the existence of an active path between the source and target publications equals 0.03, which is much lower than the probability of 0.40 in case 1. Intermediacy scores tend to be higher in case 2 than in case 1. This can be seen by comparing Fig. 5B to Fig. 4B. We note that the former figure has a linear horizontal axis, while the horizontal axis in the latter figure is logarithmic. The Spearman and Pearson correlations are somewhat higher in case 2 ( Fig. 5C and Fig. 5D) than in case 1 (Fig. 4C and Fig. 4D). Table 2 lists the ten publications with the highest intermediacy, where we use a value of 0.1 for the parameter p, like in Table 1. Fig. 5E shows the citation network of the ten most intermediate publications. There are numerous paths in this citation network going from our source publication (Garcia et al., 2015) to our target publication (Cole & Cole, 1967). We regard these paths as the core paths between the source and target publications.
The core paths shown in Fig. 5E can be compared to the results obtained by Batagelj et al. (22) using main path analysis. Different variants of main path analysis were used by Batagelj et al. (22). Both using the original version of main path analysis (9) and using a more recent variant (12), the paths that were identified were rather lengthy, as can be seen in Figs. 9 and 10 in Batagelj et al. (22). The shortest main paths included about 20 publications. This confirms the fundamental difference between intermediacy and main path analysis. Main path analysis tends to favor longer paths over   shorter ones, whereas intermediacy has the opposite tendency.
Using the results presented in Table 2 and Fig. 5E, experts on the topic of peer review could discuss the historical development of the literature on this topic. Since our own expertise on the topic of peer review is limited, we refrain from providing an interpretation of the results.

Discussion
Citation networks provide valuable information for tracing the historical development of scientific knowledge. For this purpose, citation networks are usually analyzed using main path analysis (9). However, the idea of a main path is relatively poorly understood. The algorithmic definition of a main path is clear, but the underlying conceptual motivation remains somewhat obscure. As we have shown in this paper, main path analysis has the tendency to favor longer paths over shorter ones. We consider this to be a counterintuitive property that lacks a convincing justification.
Intermediacy, introduced in this paper, offers an alternative to main path analysis. It provides a principled approach for identifying publications that appear to play a major role in the historical development from an older to a more recent publication. The older publication and the more recent one are referred to as the target and the source, respectively. Publications with a high intermediacy are important in connecting the source and the target publication in a citation network. As we have shown, intermediacy has two intuitively desirable properties, referred to as path addition and path contraction. Because of the path contraction property, intermediacy tends to favor shorter paths over longer ones. This is a fundamental difference with main path analysis. Intermediacy also has a free parameter that can be used to fine-tune its behavior. This parameter enables interpolation between two extremes. In one extreme, intermediacy identifies publications located on a shortest path between the source and the target publication.
In the other extreme, it identifies publications located on the largest number of edge independent source-target paths.
We have also examined intermediacy in two case studies. In the first case study, intermediacy was used to trace historical developments at the interface between the community detection and the scientometric literature. This case study has shown that intermediacy yields results that appear sensible from the point of view of a domain expert. The second case study, in which intermediacy was applied to the literature on peer review, has provided an empirical illustration of the differences between intermediacy and main path analysis.
There are various directions for further research. First of all, a more extensive mathematical analysis of intermediacy can be carried out, possibly resulting in an axiomatic foundation for intermediacy. Intermediacy can also be generalized to weighted graphs. In a citation network, a citation link may for instance be weighed inversely proportional to the total number of incoming or outgoing citation links of a publication. Another way to generalize intermediacy is to allow for multiple sources and targets. The ideas underlying intermediacy may also be used to develop other types of indicators for graphs, such as an indicator of the connectedness of two nodes in a graph. In empirical analyses, intermediacy can be applied not only in citation networks of scientific publications, but for instance also in patent citation networks or in completely different types of networks, such as human mobility and migration networks, world trade networks, transportation networks, and passing networks in sports.

Materials and Methods
Proofs. Below we provide the proofs of the theorems presented in the main text. We first need to introduce some additional notation. We use Pr(Xuv) as a shorthand for Pr(Xuv = 1). To make explicit that this probability depends on a graph G, we write Pr(Xuv | G). Furthermore, we use Ae to indicate whether an edge e is active. Hence, Ae = 1 if edge e is active and Ae = 0 if edge e is not active.
Proof of Theorem 1 . Let m = |E| denote the number of edges in the graph G. Suppose that the m edges are split into two sets, one set of M edges and another set of m − M edges. The probability that the edges in the former set are all active while the edges in the latter set are all inactive equals Consider a node v ∈ V . The shortest source-target path that goes through node v has a length of v . This means that at least v edges need to be active in order to obtain an active source-target path that goes through node v. Hence, the probability that there is an active source-target path that goes through node v can be written as where n vi > 0 for all i = v , . . . , m. Note that this probability equals the intermediacy of node v. Now consider two nodes u, v ∈ V with u < v . In the limit as p tends to 0, φu and φv both tend to 0. However, they do so at different rates. More specifically, in the limit as p tends to 0, we have Hence, in the limit as p tends to 0, φu > φv.
Proof of Theorem 2 . Let m = |E| denote the number of edges in the graph G, and let q denote the probability that an edge is inactive, that is, q = 1 − p. Suppose that the m edges are split into two sets, one set of M edges and another set of m − M edges. The probability that the edges in the former set are all inactive while the edges in the latter set are all active equals Consider a node v ∈ V . There are σv edge independent sourcetarget paths that go through node v. This means that at least σv edges need to be inactive in order for there to be no active source-target path that goes through node v. Hence, the probability that there is no active source-target path that goes through node v can be written as where n vi > 0 for all i = σv, . . . , m. Note that the intermediacy of node v equals 1 minus this probability, that is, φv = 1 − Φv. Now consider two nodes u, v ∈ V with σu > σv. In the limit as p tends to 1, Φu and Φv both tend to 0. However, they do so at different rates. More specifically, in the limit as p tends to 1, we have Hence, in the limit as p tends to 1, Φu < Φv, which implies that φu > φv.
Proof of Theorem 3 . Suppose that node w is located on a path from source s to node u. Let H denote the graph obtained after the path from node u to node v has been added, and let Euv denote the set of newly added edges. The intermediacy of node w in graph G can be factorized as φw(G) = Pr(Xsw | G) Pr(Xwt | G). Similarly, for graph H, we have φw(H) = Pr(Xsw | H) Pr(Xwt | H). Clearly, Pr(Xsw | G) = Pr(Xsw | H), since the paths from node s to node w are identical in graphs G and H. Furthermore, Pr(Xwt | G) = Pr(Xwt | H and ∀e ∈ Euv : Ae = 0). Since Pr(Xwt | H and ∀e ∈ Euv : Ae = 0) ≤ Pr(Xwt | H), it follows that Pr(Xwt | G) ≤ Pr(Xwt | H). This means that φw(G) ≤ φw(H).
An analogous proof can be given if node w is located on a path from node v to target t.
Proof of Theorem 4 . Suppose that node w is located on a path from source s to node u. Let H denote the graph obtained after paths from node u to node v have been contracted, and let Euv denote the set of all edges between nodes in Vuv. The intermediacy An analogous proof can be given if node w is located on a path from node v to target t.
Algorithms. Intermediacy depends on the probability that there exists a path between two nodes in a graph. Determining this probability is known as the problem of network reliability. This problem is NP-hard (25). Below we provide an outline of an exact algorithm for calculating intermediacy. Because of its exponential runtime, the exact algorithm can be used only in relatively small graphs. We therefore also propose a Monte Carlo algorithm that approximates intermediacy.
Exact algorithm. The exact algorithm, illustrated in Fig. 6A, is based on contraction and deletion of edges (26). Suppose we have a graph G = (V , E). The probability that there exists a path between two nodes u, v ∈ V can be written as Pr(Xuv | G) = p Pr(Xuv | G/e) + (1 − p) Pr(Xuv | G − e), [8] where G/e denotes the contraction of an edge e ∈ E and G − e denotes the deletion of an edge e ∈ E. Edge contraction must respect reachability (27). Eq. 8 yields a recursive algorithm for calculating Pr(Xuv). For a node v ∈ V , this algorithm can be used to calculate Pr(Xsv) and Pr(Xvt). The intermediacy φv of node v is then given by Eq. 1. We are usually interested in calculating the intermediacy of all nodes in a graph G, not just of one specific node. This can be performed efficiently by calculating Pr(Xsv) and Pr(Xvt) for all nodes v ∈ V in a single recursion.
The runtime of the exact algorithm is exponential in the number of edges m. The algorithm has a complexity of O(2 m ). In the special case of a so-called series-parallel graph, the runtime of the algorithm can be reduced from exponential to polynomial (28).
Monte Carlo algorithm. The Monte Carlo algorithm, illustrated in Fig. 6B, is quite straightforward. Suppose we have a graph G = (V , E) and we are interested in the intermediacy φv of a node v ∈ V . A subgraph H can be obtained by sampling the edges in the graph G, where each edge e ∈ E is sampled with probability p. Given a subgraph H, it can be determined whether in this subgraph node v is located on a path from source s to target t. We sample N subgraphs H 1 , . . . , H N . We then approximate the intermediacy of node v by φv ≈ 1 where Ist(v | H i ) equals 1 if there exists a path from source s to target t going through node v in graph H i and 0 otherwise. The Monte Carlo algorithm can be implemented efficiently by simultaneously sampling subgraphs and checking path existence. To do so, we perform a probabilistic depth first search. We maintain a stack of nodes that still need to be visited. We start by pushing source s to the stack. We then keep popping nodes from the stack until the stack is empty. When a node v has been popped from the stack, we determine for each of its outgoing edges whether the edge is active. An edge is active with probability p. If an edge (v, u) is active and if node u is not yet on the stack, then node u is pushed to the stack. At some point, target t may be reached, resulting in the identification of nodes that are located on a path from source s to target t. This implementation of the Monte Carlo algorithm is especially fast for smaller values of the probability p. The runtime of the Monte Carlo algorithm is linear in the number of edges m.
Source code. In this paper, we use a Java implementation of the Monte Carlo algorithm. The source code is available at https://github. com/lovre/intermediacy (29).