Pyntacle: a parallel computing-enabled framework for large-scale network biology analysis

Abstract Background Some natural systems are big in size, complex, and often characterized by convoluted mechanisms of interaction, such as epistasis, pleiotropy, and trophism, which cannot be immediately ascribed to individual natural events or biological entities but that are often derived from group effects. However, the determination of important groups of entities, such as genes or proteins, in complex systems is considered a computationally hard task. Results We present Pyntacle, a high-performance framework designed to exploit parallel computing and graph theory to efficiently identify critical groups in big networks and in scenarios that cannot be tackled with traditional network analysis approaches. Conclusions We showcase potential applications of Pyntacle with transcriptomics and structural biology data, thereby highlighting the outstanding improvement in terms of computational resources over existing tools.

> We thank the reviewer for the suggested reads, we integrated them in the revised manuscript. As far as the proposed benchmark is concerned, we think that it would be very informative and useful. However, we prefer to postpone this to the next future for two reasons: 1-as the reviewer has noticed, Pyntacle is equipped with just one heuristic method. We think that the suggested effort would be much more effective to take when Pyntacle will run several more algorithms. This will happen with the next major release, scheduled to come out sometimes in the second half of the next year. 2-we are aware that those presented are preliminary benchmarks and we were struggling with whether to present them or not. We would prefer to keep them at this preliminary level or, eventually, completely removing them, if suggested by the reviewers.
A.1.2. Insufficient references to existing software. Although a review of existing network libraries is beyond the scope of the manuscript, it could be useful for readers to at least mention some other general purpose graph libraries usable from Python such as graph-tool , igraph , NetworkX , and SNAP .
(See Section C below for more).
Most particularly, it is noted that Pyntacle uses igraph modules, yet the igraph paper  is not cited.
In addition, although the Analytic Technologies KeyPlayer program and the R keyplayer package (An & Liu, 2016) are mentioned, the influenceR R package  is not mentioned. Although this package only solves KPP-Pos with the DF metric, as I show in Section B below (on a single example) it does so much faster than Pyntacle.
> We integrated the references in the revised manuscript.
For Case Study 1, I was easily able to reproduce the 3D protein cartoon in Fig. 6 by downloading the 5XTD structure from the RCSB PDB into PyMOL (Schrödinger, LLC), selecting the appropriate two chains, and locating the key player residues as shown. However to obtain the network of "The 138 residues forming the interaction interface" I would (according to the Methods) have to conduct several operations in FoldX and then extract the network according to a distance threshold described in the Methods. Although it seems there is sufficient detail to reproduce this, it requires a registration to use FoldX and some further work construct the network.
I think it would assist reproducibility (an important consideration in the GigaScience journal) to supply the network as constructed by the authors as a Supplementary Information file (as for example the results of the Pyntacle analysis are already supplied in an Excel file).
Similarly, the network constructed for Case Study 2 could also be made available as Supplementary Information. > We have stored the adjacency matrices of the two case studies in GigaDB. N.b., A member of the GigaScience editorial team confirmed that a link to these files will be made available eventually after the publication of this piece of research. Moreover, we have explained how we made them in order to improve replicability. In fact, in Case Study 1, the 138 residues identified via Foldx are those considered at the interface and for which energetic contributions were calculated. The network we built has been obtained by directly analyzing the PDB file of the complex, connecting residues of different chains, and relying only on distance constraints (therefore avoiding external software like FoldX). As we demonstrate in the case study, valuable information is gained nonetheless, without using external software (FoldX) and in much less time.
The manuscript is described (including in the title) as being for "large-scale" network analysis, and parallelism is emphasized. However (related to point A.1.1 above) it seems to me that the emphasis is perhaps misplaced. To be specific, two algorithms are implemented and benchmarked: the "brute-force" algorithm, and the greedy heuristic algorithm of Borgatti (2006). Parallelism then mainly seems to be discussed in relation to the brute-force algorithm.
However, as the KPP is known to be computationally intractable and indeed NPcomplete ), a brute-force algorithm is never going to be practical, and certainly not for large-scale instances. For this reason there are two practical approaches: (a) exact solution by more sophisticated methods (for example integer linear programming (ILP), semidefinite programming , branch-and-cut methods, etc.), and (b) heuristic algorithms to find a potentially suboptimal solution, but much faster. Note that the latter are also vital as input into the former (as bounds on the solution). For a review of these approaches see .
Hence my concern is that implementing and applying parallel computation to the bruteforce approach is misplaced, as it can never be competitive with sophisticated exact solution algorithms (particularly where these could also be parallelized) on large-scale problems.
Regarding the heuristic (greedy) algorithm, it also seems that better results could be obtained with a more sophisticated heuristic, as already foreshadowed by Borgatti (2006). For example I demonstrate in Section B below, that the influenceR program is two orders of magnitude faster (albeit on a single example) than Pyntacle for KPP-Pos, as it uses a different heuristic algorithm. In addition there are many publications showing superior results with different heuristic algorithms for CNDP variants (see  for a review).
> BF was implemented because it is exact and without aiming to compete with any other search algorithm. The fact that we implemented some parallelism for it is because some fields of research, like ecology, deal with small/medium size networks, where size is in the order of a few tens/hundreds of nodes. Why should we yield suboptimal results for these when exact is achievable? We were not clear enough on this: parallelism was applied to BF (coarse+fine grain) as well as on heuristics (fine grain only). Fine-grain parallelism applies directly to the way we enumerate shortest-paths and, by this feature, Pyntacle drops the computational burden of elaborating big matrices/graphs. This happens regardless of whether one uses BF or GO search strategies. However, we agree that Pyntacle could greatly benefit from more sophisticated algorithms and the implementation of a range of different search and optimization algorithms (e.g. particle swarm, ant colony, stochastic gradient descent, etc.), that the user can select from, is in our plans.
From my understanding of the manuscript (and some experiments I tried with it detailed in Section B below), there are two types of parallelism used in Pyntacle. First, the "fine grained" parallelism it obtains simply from using the Numba library, and second, explicit "coarse grained" parallelism. The first type uses multithreading in lower level numeric libraries, similar to Microsoft R/Open using multithreaded linear algebra routines. The second type is used only for the brute-force algorithm however (and on Linux seems to work by forking new processes rather than using threads).
> This reviewer is almost correct. Multi-threading is exploited by Numba in the calculation of all shortest-paths (Floyd-Warshall) and only for this task. We do not use any sort of off-the-shelf linear algebra library. Numba allows us to release the global interpreter lock when computing a (not so) restricted set of manipulation tasks on Numpy arrays. With the GIL released, we can spawn threads, handle concurrency, and, therefore, save a for-loop from the computation of Floyd-Warshall. Therefore, all metrics based on the calculation of the shortest-paths deal with this. As a side note, in Pyntacle we have three codes to calculate all the shortest-paths of a graph: -That borrowed from igraph (sequential) -That implemented by us using Numba (multi-threading) -That implemented by us using Numba (GPU) The Brute-force search, on the other hand, relies on the multiprocess Python package and, as correctly noted, is used to fork processes and not threads. It is used exclusively here because Brute-force search knows the exact number of groups to be computed in advance. Although possible, we have discouraged the mixed-use of processes and threads.
It is not entirely clear, but it seems to me that Figure 4 shows results for using the greedy heuristic with a single thread (no parallelism). Thus it compares the same heuristic algorithm implemented in Pyntacle and R/keyplayer, showing that Pyntacle is faster (due to implementation details, being in Python not R, etc.). This is a fair comparison if both implementations are really using only a single thread. However I found that Pyntacle used multithreading (up to the number of cores on the node it is run on) automatically, unless the NUMBA_NUM_THREADS environment variable is set (I am not familiar with the Numba library and had to find this in its online documentation). I could not find any documentation in the Pyntacle online manuals about parallelism or adjusting the number of threads, except for the -T/--threads option which applies to the brute-force algorithm only (not the greedy heuristic). The keyplayer R package, however, uses a single thread unless explicitly requested to use parallelism by a function argument.
Hence I think this should be clarified in the manuscript, and some documentation of how parallelism works and how to adjust the number of threads added to the Pyntacle documentation.
> Since the command-line interface of Pyntacle was designed for not-experts, we decided to hide one level of parallelism (the fine-grained) to the user and to automatically manage the parallelism related to the calculation of the shortest paths. We used this simple algorithm that we will make clever and adaptive in future releases. // auto-select the computing mode if nprocs > 1 // n.b., nprocs is user-defined Let's enable multi-process and disable multi-threading else if size(graph) < 250 or rho(graph)<0.5 //rho measures sparseness Let's disable multi-process and disable multi-threading else Let's disable multi-process and enable multi-threading N.b., we have replaced -T/--threads with -O/--nprocs since it actually refers to the coarse-grained parallelism that forks processes and not threads.
The number of threads of the fine-grained parallelism is internally set to the number of available cpus -1 by default, which can be tuned by setting the env variable NUMBA_NUM_THREADS, as correctly pointed out. However, caution must be paid on this: Numba adjusts the number of active threads on-the-fly according to the current overheads and, hence, the efficiency of parallelism. This means that what specified in the environment variable might not be actually respected. This is the reason why we were not explicit about this aspect. We did not want to cause confusion and frustrating experience to not-expert users. We have made this clear in the online documentation.
Multi-threading is however controllable via APIs. To actually make fair comparisons, we benchmarked naked functions with code like this:   5, however, shows the speedup obtained by Pyntacle using the brute-force algorithm with explicit (coarse-grained) parallelism as described, of computing the metric on the independent possible node sets in parallel. This is a valid comparison, however as per A.1.4 above, I question the practicality of the brute-force method at all, given the availability of many more efficient (albeit more complicated, and perhaps requiring a sophisticated ILP solver for example) algorithms for exact solution.
> As in point A.1.4, we agree that there is a broad spectrum of better heuristic search algorithms and optimization techniques than the one currently implemented in Pyntacle. It is our goal to integrate them in the next releases of Pyntacle. However, in GitHub, we have created a fork ("feat#31) where we have already implemented the Stochastic Gradient Descent algorithm. This branch is not still merged because this new implementation has to be carefully tested and unittests still need to be done.
The manuscript mentions graphics processing units (GPU) for parallelism, but I found no further details on this. When I run Pyntacle on a node with no GPU, I get a warning message that "NVIDIA: no NVIDIA devices found" which would seem to indicate that NVIDIA (CUDA) GPU support is available, and when I run the pyntacle keyplayer module on a cluster node which does have a GPU, I do not get the warning message, but the GPU does not seem to be used (checking with the nvidia-smi program).
Note that I installed the 'miniconda' package on a cluster head node which does not have a GPU (details in Section B below). > This reviewer is correct. GPU-base processing is an experimental feature at the moment and, as noted by the algorithm above, not covered by the command line. This is because of weird behaviors of Numba with some hardware that might puzzle the user. The GPU feature will be stable in the release 2.0 when we will have covered the possibility to manage big matrices for which replacing fine-grained parallelism with GPU computing would make sense. We have clearly specified this in the documentation. However, GPU-computing is currently possible by APIs: ... BruteforceSearch.reachability( graph_rnd, 2, KpposEnum.mreach, None, m=2, cmode=CmodeEnum.gpu, nprocs=1) ... This command applied on a 4000 nodes network makes Pyntacle use just ⅓ of the available GPU capabilities and a not negligible overhead. For this reason, also, we will fully include the GPU option when Pyntacle will be able to handle > 10.000 nodes, efficiently.

A.2. Minor issues
The variable F is first mentioned in Case Study 1 (p. 7 "... the best negative keyplayers in the network (according to both F and DF metrics), ...") but not defined, although it presumably refers to the fragmentation centrality of Borgatti (2006, equation (4)). However it is then used as a variable representing any centrality function (keyplayer metric) in Algorithms 1 and 2 (p. 9), which could be confusing. Sorry for this. We have replaced F with FUNC in the algorithms. We have defined Borgatti's F metrics in methods. Thanks.
At the bottom of p. 5 (column 1): "... keyplayer is not rigorously testable here since it is a Windows-only GUI-based application." This could be confusing as the R keyplayer package (An & Liu, 2016) is previously styled as "keyplayer" (lowercase) while the Analytic Technologies KeyPlayer program (Borgatti, 2019) is styled as "KeyPlayer" ("camel case"), and it is the Analytic Technologies KeyPlayer program that is Windowsonly. The Pyntacle website referenced in the manuscript ([http://pyntacle.cssmendel.it)]http://pyntacle.css-mendel.it) is easy to navigate and read, and contains extensive documentation on installation and usage, as well as, importantly, examples and case studies which can be followed and reproduced easily.
I installed the Pyntacle software on a Linux server (the master [head] node of the cluster described below, which has 2 x Intel E5-2630 v3, 16 (2 x 8) cores CPU,64GB DDR4 @ 2133MHz RAM) using the 'miniconda' system as recommended on the Pyntacle home page. This worked without problem, and I ran the tests which passed. I was also easily able to reproduce Case Study 3 (C. elegans connectome) linked from the software home page.
I then ran some of my own tests (using the command line interface, running as batch jobs on cluster compute nodes) to compare results between Pyntacle, the keyplayer (v1.0.3) R package (An & Liu, 2016) and (for KPP-Pos only) the influenceR R package (v0.l.0) .
I came across one bug in the command line interface: the --type m-reach option does not seem to work. But I found this already reported in the GitHub issue tracker as issue #23.

> Amended
Note I ran the tests once only, so the times (which are elapsed times of the key player finding module only, as reported by Pyntacle itself, or system.time() for R packages) are merely indicative. For parallel tests I reserved all 20 cores on a compute node. For serial tests I reserved a single core on a compute node. I set environment variable NUMBA_NUM_THREADS for Pyntacle and OMP_NUM_THREADS for R packages using OpenMP, and the parallel flag and cluster option for the keyplayer R package, to the appropriate number of cores (20 or 1) to ensure the correct number of cores is (potentially) used.
The network data set used is the CAEEL_Connectome.sif file as used in "Case Study 3" from the Pyntacle home page. I used the "pyntacle convert" command to convert it from SIF format to edgelist format for use in the R keyplayers package. Note that the influenceR package only does KPP-Pos so it is not included in this test. So this shows Pyntacle computing the KPP-Neg (DF metric) correctly on this test, two orders of magnitude faster than the R keyplayer software. There is no real difference in the time for Pyntacle on 1 core versus 20 cores, perhaps unimportant when it takes less than 20 seconds anyway. I'm not sure why the keyplayer R package is so much slower with 20 cores than 1 core, but I suspect it is because the parallel mode in that package is for trying to avoid local minima by running the algorithm with different randomization seeds in parallel (An & Liu 2016, p. 264) rather than speeding the computation of a single run.

Method
In running these tests I found a bug in Pyntacle-Ink (the HTML graphical output): for KPP-Neg metric dF it reports both "Initial Value" and "Value" but they appear to be swapped (they are the right way around in the stdout and the Report TSV file however).

> Amended
Test 2: Key players DR (KPP-Pos) of size k = 2 in the C. elegans PPI network (3303 nodes), computed with greedy heuristic. (This should be the same test as the benchmark shown in Fig. 4 of the manuscript for (D)).
The network data used here is the APID_CAEEL_Level2_maincomponent.adjm adjacency matrix file as used in "Case Study 2" linked from the Pyntacle software home page, and referred to as "APID" in the manuscript.  Fig. 4, so this seems reasonable. And again, much faster than the keyplayer R package (which could not find a solution within 20 hours).

Method
Again, however, using Pyntacle key player module using the greedy heuristic with 20 cores has only a relatively small advantage over a single core, even though the elapsed time is now several minutes. The Linux (tcsh) time builtin shows 99.8% CPU for the serial run and only 109.0% CPU for the parallel run, showing that little parallelism was used.
> the reason for that might be ascribed to the fact that Numba reduces parallelism when overheads become significant.
However, note that the influenceR R package, which uses a different heuristic algorithm, "stochastic gradient descent" (Jacobs et al., 2015, "keyplayer") is able to find the same solution as Pyntacle, but two orders of magnitude faster (note it also finds a different solution in the serial run -since it is a stochastic algorithm we could get different values each run -but unfortunately does not seem to report the objective value so we cannot see if it is a better or worse solution).
> this note is weak as pointed out by the reviewer him/herself, also because the SGD explores a smaller solution space than the GO algorithm implemented in Pyntacle. In particular: Stochastic Gradient Descent (SGD)Greedy-optimization search (GO) 1.Select k nodes at random to populate set S 2.Set F = fit using appropriate key player metric (KPP-Pos in our case) 3.Get a new state: ○Pick a random u in S and v not in S. ○F' = fit if u and v were swapped ○If F' > F, swap u and v in S. Else, repeat step 3. (Alternatively, if a positive value is given for the `prob' parameter, a swap will be accepted with a small probability regardless of whether it improves the fit). If F' -F < tolerance or our maximum computation time is exceeded, return S. Else, go to step 3. > SGD adds a random contribution that (can) supersedes the fitting function and saves the loop in red in the GO algorithm. This last fact increases the chance to fall into disadvantageous local maxima, much more than the other algorithm. Even without any deep analysis, out of this scope, it is intuitive why the former is faster than the latter, as it is equally clear that the chances for the former to get the best solution are less than those of the GO algorithm.
Note also that influenceR (despite the short elapsed time) also seems to be making good use of multicore (20 cores): the (Linux) time command shows similar user and elapsed time for the serial run (as expected), but the parallel run shows 26.137 user CPU seconds and only 1.208 elapsed seconds (and only 0.016 system).
Note I also tried to find the key player set size k = 2 for the DF metric (KPP-Neg) with the greedy heuristic in Pyntacle in this data set, but it was unable to complete with the time limit of 20 hours (serial or parallel).
> Pyntacle will do better with more efficient search algorithms in the next releases.
This list is not necessarily exhaustive, but are some packages that I have used or are aware of, particularly those that are usable from Python. They are all general-purpose graph libraries, with the exception of influenceR and keyplayer, which are specifically for the key players problem. Some have many more features than others though; for example SNAP (Leskovec) handles very large networks efficiently, but has very few algorithms implemented compared to igraph or NetworkX).

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Reviewer #2: In this article, the authors describe Pyntacle, a Python-based software tool for the analysis of networks. Specifically, they focus on identifying groups of nodes by performing a combinatorial evaluation, of various parameters/indices/centrality measures. The manuscript provides a good outline of the package and its capabilities. Their approach is explained well; I found Figure 1 to be a good toy example to illustrate the capabilities of the network. I have a few concerns for the authors to comment upon/implement to further improve their manuscript.
Specific comments: 1. I find the networks in Figure 5 to be very small and not representative of biological networks in complexity, size and density. It would be nice if the examples are changed to 10^3 and 10^4 nodes, discarding 100 nodes. Or, will the combinatorial complexity become a major problem for larger networks?
> As we have shown in the case study, with the network derived from a protein-protein interaction interface, the target of Pyntacle is represented also by small networks (<100 nodes/edges), therefore we wanted to present the different scenarios in which the potential user can find Pyntacle useful, e.g. residue contact or H-bond networks in a complex during a molecular dynamics simulation. Bigger networks (10^3 or 10^4 nodes) would not equally display the same level of informativeness in a Figure given their sizes; moreover, they would not represent a fair comparison ground with the other methods, since already it took more than one day of computation with networks in the 10^3 category, revealing the speedup gain when using Pyntacle. Finally, 10^4-node networks are indeed combinatorially complex to analyze, they will be properly dealt with in the next release of Pyntacle, with a stable GPU implementation and faster heuristic search algorithms and optimization approaches to choose from (as pointed out by reviewer 1 as well).
2. Perhaps, it would be nice if another example, of a protein-protein functional association network from the STRING database is considered. It would be a biologically motivating example, given the opening discussions about protein interaction networks.
> The C.elegans network, collected from APID and used for benchmarking and tests in the manuscript, is built from quality-checked protein-protein interaction data. We clarified this in the text.
in the two case studies can now be accessed from GigaDB.
5. An example case study with a larger network, such as E. coli from STRING would also add a lot of value to the paper.
> The STRING network for C.elegans is one order of magnitude bigger than the E.coli network from the same resource: the STRING C.elegans network counts 18181 nodes (6170 when considering only high-quality interactions with a combined score equal or higher than 900) while the STRING E.coli K12 W3110/MG1655 networks count respectively 4210/4125 nodes (which decrease to 2698/3259 nodes when applying the same high-quality filter). The C.elegans network we used (collected from the APID resource) counts 3303 nodes, same order of magnitude of the STRING networks, and therefore is a suitable representative of this kind of ppi networks. The number of edges greatly varies on the quality of each reported interaction.
Optional features: 1. Lastly, it would be nice if there is a way to pull networks from commonly used databases such as networkrepository.com or STRING. This would enable a lot of network scientists and biologists to easily perform various network analyses.
> This is a great option to be added in the immediate next release of Pyntacle, the user would be given the choice of selecting among the most used repositories (e.g. STRING, IMEx and APID).

Background
Interactive systems are commonly represented as graphs (or networks), which are mathematical representations of elements (nodes) and their relationships (edges). The semantics of relationships is speci c for each graph and completely de nes its expressiveness. Protein interaction networks, for example, represents physical interactions as edges and proteins as nodes; metabolic networks wire metabolites if these participate in the same biochemical reactions; regulatory networks are directed graphs, where the directionality of relationships matters. Thus, a link exists between two molecules if there is evidence either of regulatory activity by a transcription factor onto a gene or of post-translational modi cations. These, together with several other kinds of networks, like RNA, signaling, neuronal, trophic, and co-expression networks, are the concrete signs of an ex-ceptional growth of molecular interaction data and, hence, of an intense research activity in the eld of Network medicine [1].
Network medicine is a relatively new discipline that exploits graph theory to identify key-molecules in the human diseasome [2] together with their hidden molecular relationships. The general aim is that of reverse-engineering the mechanisms of pathogenesis of complex disorders and traits, whereby the etiology is notoriously convoluted. The diseasome is, in fact, a network where diseases are nodes and links represent relationships between the disease-associated cellular components. Determining such links would help identify the molecular relationships between phenotypes, the reasons of certain comorbidities, and would positively a ect diagnosis, treatment and drug multi-purposing.
Certain kinds of biological networks share the feature of Compiled on: July 10, 2020. Draft manuscript prepared by the author.

Key Points
• Understanding the architecture of networks as well as the key-roles of their components is a process that historically relies on the calculation of sets of global and local topological indices. • The information provided by local topological indices is little or meaningless for natural networks of big sizes or for those that model complex events, like epistasis, pleiotropy, and trophism, which in fact result from the interleaving actions of multiple natural actors. • Pyntacle provides the user with an array of computationally e cient algorithms to manage networks and to search and nd important groups of nodes.
having a few relatively few highly connected nodes, often called hubs, suggesting that the molecules represented by hubs should play special biological roles. The rst hypothesis of Network medicine is that a vast majority of known diseases genes, which are non-essential, lie in the periphery of these networks and are far from hubs. On the contrary, at least in human cells, hub molecules are encoded by essential genes [3]. A database, DEG, exists that reports essential genes for some bacteria, archaea and eukaryotes [4]. An interesting speculation is that, because of their many links, hubs are reasonably associated with disease genes [5,6,7], which in turn, by virtue of the local hypothesis of Network medicine, exhibit increased tendency to interact with each other, being them all involved in the same disease. Thus, molecular networks are not random, but tightly organized based on speci c principles, according to which the e ect of a central gene, which is eventually aberrant, reverberates on the gene products of neighboring genes in its network. Hence, the expression of a disease phenotype does rarely result from an individual aberrant gene, rather from the harmonized e ects of groups of related genes. This holds true also for other types of networks, ranging from ecological, to evolutionary and chemical networks.
Graph theory draws upon various tools to identify the most central elements, i.e. the key molecules, in a network. Here, the concept of centrality is synonymous with importance, even if it has been seen to decline di erently in literature. A topologically important node may be a hub, a bottleneck, namely a node that lies in many pathways, or one that is "close" to most other nodes. But for what said above, local (i.e., regarding nodes or edges) and global (i.e., regarding the entire network) properties of networks are unlikely to completely explain the functioning of complex systems, since they do not take into account or underestimate the e ects that groups of important nodes may jointly exert on these systems. Node 2 in Figure 1 has 7 ties and it is the highest degree node in this example network. It is connected with 7 unimportant nodes, since these exhibit low degree values. Node 9 is the second more connected node with only one less edge than node 2, but two of its neighbors, i.e., node 16 and 21, are the third and fourth ranked nodes by degree with 5 and 4 ties, respectively. Thus, although node 2 is top ranked by degree, it may not be the most functionally central node. Whether this assertion is true or not strictly depends on the purposes by which a network is being studied.
More interestingly, the network parsimony principle of Network medicine, according to which causal molecular pathways often coincide with the shortest molecular paths between known disease-associated components, implies that it is fundamental to nd the nodes that lie within the highest number of pathways in networks, since these are more likely to be functionally critical [1]. The betweenness centrality index [8] is the most suitable for this task. Node 21 in Figure 1 is the top ranked node by betweenness. This was expected since it lies in the exact middle of the network, that in turn, exhibits a quasi-tree topological node organization. But even if node 21 belongs to almost all  shortest paths of the network, it ranks only fourth by degree, since it is not individually much connected. Node 9 is the second node by betweenness with a score very close to that of node 21 but, on the contrary, it ranks second by degree (Supplementary data S1). Whether the most important node is 2, 21 or node 9 depends on the aims and context of the study.
Whenever more than one node exhibits similar topological scores, as in this case or when a co-responsibility for a phenotype is suspected, studying groups and their centrality might be a reasonable option. In 1999, Everett and Borgatti expanded the de nition of degree, betweenness and closeness to groups of nodes [9]. Calculating these indices for the following groups: {2, 21}, {9, 21}, {2, 9}, the latter achieved the highest scores. Moreover, considering the group made by all three nodes, only degree and closeness increased signi cantly in respect to {2, 9} (cf. Table 1).
In 2006, Borgatti introduced other two classes of metrics for groups that were meant to assess the ability of groups either to disrupt a network, when removed, or to e ciently spread information through a network. These were de ned as "Key-Player Problem/Negative" (KPP-Neg) and "Key-Player Prob-lem/Positive" (KPP-Pos), respectively [10]. N.B. similar concepts were also covered in other research elds and scienti c contexts [11,12], where speci c search strategies [13,14] were implemented. KPP-Neg and KPP-Pos were calculated for the same groups and reported in Table 2. It is interesting to notice that {2, 9} is still the most important group in terms of disruption potential and connectivity. Their scores were slightly lower than those of group {2, 9, 11}, meaning that even here node 11 does not contribute signi cantly to the centrality of {2, 9}. Table 2. KPP-Neg and KPP-Pos metrics calculated for the example network. DF (Neg) achieves its maximum value of 1.0 when the graph consists entirely of isolated nodes. M-reach (Pos) is a count of the number of unique nodes reached by any member of the group in m links or less. DR (Pos) achieves a maximum value of 1 when every non-group node is adjacent to at least one member of the group. What remains to be veri ed is whether any other group exists that exhibits similar or higher centrality values. Considering the small network size, the option of running a bruteforce algorithm to search the absolute best group(s) among all possible ones is computationally feasible, in place of a greedy-optimization search, as suggested by Borgatti in [10]. In this case, the best group of size 2 for all metrics is still {2, 9}, whereas {5, 9} reaches 100% of non-group nodes and ranked rst by m-reach. However, since none of the centrality scores of {2, 9} equaled their maximum possible values, we applied again the brute-force search to groups of increasing sizes, 3 through 6. We thus found that degree and closeness reached their absolute maximum scores, i.e. 1, equally with two groups {2, 9, 11, 16, 21}, {2, 9, 10, 16, 21} of size 5, meaning that nodes 10 and 11 are interchangeable and equally important; betweenness obtained its maximum score (0.497) with the group {2, 9,11,16,21} (Supplementary data S2). The best group by DF is {2, 9, 11, 14, 16, 21}, which achieves the score of 1. The groups {2, 9, 10, 16, 21} and {2, 9, 11, 16, 21} equally obtained the best DR score (0.792). It is worth noticing that DR and betweenness do not reach their absolute maximum scores, which however are plausibly the highest possible for this network, since groups of bigger sizes exhibit lower scores ( Figure  2). It is also interesting to notice that nodes 2, 9 and 21 are included in all groups found above, thereby highlighting their central roles in the network (Supplementary data S3).
Computing the nestedness, which consists in verifying whether sets of nodes recur in groups of increasing sizes, could con rm the importance of nodes 2, 9 and 21. Hence, if larger sets contain smaller sets, higher values of nestedness may be a proxy for identifying upstream/master regulators through the key-nodes of the smallest groups. One way to calculate the nestedness of the example network is by the Nrow metrics [15,16]. Nrow is de ned as the average percentage of nodes from smaller sets that are contained in larger sets, taking all possible pairs of sets. Thus, after computing all the best sets of increasing sizes, from 2 to 5, for each group centrality metrics but m-reach, nodes 2 and 9, and not 21, resulted to be nested in all sets, regardless of their size (Figure 3 and Supplementary data S2). The same evidence emerged with the key-player metrics (Supplementary data S3). The nestedness scores were generally quite high, meaning that nodes are not interchangeable among groups, i.e., there are few equally important nodes. The group {2, 9} is de nitely important from a topological point of view and its discovery would have not been immediately hypothesized without this investigation, since the nodes 2 and 9 are ve links apart.
This "practical" introduction aims at introducing the theory underlying Pyntacle. A toy-model was used to describe the main features, outline a possible analytical pathway and highlight how Pyntacle may help extract valuable information from real-world networks. The rest of the article will thus present i) the software and its main components, ii) its design and implementation, iii) how to nely use it; iv) benchmarks, assessed on real and simulated networks of increasing sizes, in comparison with a similar software package; v) two real-world case studies.

Pyntacle
Pyntacle is an open-source network analysis framework that was originally designed to tackle the Borgatti's Key-Player Problem [10] e ciently through the identi cation of maximally reachable or disruptive groups of nodes. Contrary to similar tools that either analyze networks with standard global and local topological metrics [17,18], or that provide the users with limited tools to detect important groups of nodes [19], Pyntacle adopts optimized heuristic algorithms and parallel computing strategies to make the task of identifying key-player nodes feasible. It has the following attributes: i) available for Windows,  Mac, and Linux OS; ii) available as both command line tool and API, with an easy and user-friendly interface for both input commands and results visualization; iii) it allows the management of real-world graphs in a computationally e cient way.
Pyntacle is implemented in modules, each designed to analyze a particular aspect of a network. These can calculate global and local topological metrics (metrics module), the importance of groups of nodes (groupcentrality and keyplayer modules), search and analyze clusters of nodes (communities module), perform set operations between networks (set module), generate networks with di erent topological organizations (e.g. random, scale-free and small-world networks, generate module), convert and load/save networks using di erent data formats (e.g. adjacency matrix, edge list, SIF and dot, convert module).

Centrality measures for groups
Pyntacle tackles the problem of identifying key-player nodes that, together, optimally di use something through a network or maximally disrupt or fragment a network when removed. It further extends the standard network centrality measures of degree, closeness, and betweenness (refer to [20] for a clear introduction and to [21] for further theoretical explanations) to groups rather than individual elements. To this regard, these methods are a direct generalization of the corresponding individual measures, in a way that if, e.g., group-degree and degree are applied to groups consisting of single elements, they yield identical results. The class of algorithms are thus two: one that measures the importance of a set on the basis of its impact on the remaining nodes of a network, and another that does it by considering the sole properties of the elements of a set.
The former class is composed by the DF (KPP-Neg; cf. Eq. 2 in Methods), DR (cf. Eq. 3) and m-reach (KPP-Pos; cf. Eg. 4) algorithms [10]. KPP-Neg measures the fragmentation of a network because of a set. KPP-Pos measures the overall cohesion that members of a set have with the remainder of the network. As described in the Methods section, DF measures the degree of reachability of a set of nodes, taking also into account the degree of cohesion of the set. M-reach counts the number of unique nodes reached by any member of a set in m links or less. DR is the weighted proportion of all nodes reached by the set, where nodes are inversely weighted by their minimum distance from the set.
The latter class is made by the group-degree centrality measure that accounts for the number of non-group nodes that are connected to group members (cf. Eq. 5 in Methods); the groupbetweenness centrality measure that measures the proportion of (shortest) paths connecting pairs of non-group members that pass through the group (cf. Eq. 6); the group-closeness, which sums the distances from the group to all vertices outside the group (cf. Eq. 7).

Search strategies for optimal sets
When the aim is not to quantify the centrality of a speci c set of nodes, but that of discovering which is/are the most central set(s) in a network, the search heuristics might come in handy. In particular, Pyntacle implements a greedy optimization search heuristics presented in [10] and a brute-force combinatorial optimization search strategy (cf. Search algorithms section in Methods). The former progressively replaces the components of a starting random set with all other nodes of a graph, calculating one of the formerly mentioned centrality metrics for that group, and then stops when a sub-optimal solution is obtained. The latter loops through all possible groups of a prede ned size and returns only those exhibiting the best scores for any of the centrality measure. It is immediate that the computational complexity of the heuristic method is much lower than that of the exact method, at the cost of sub-optimal solutions. The brute-force search yields exact solutions, but is computationally impracticable for big networks.

Exploration of crosstalk pathways of sparse real-world networks
Real-world biological networks exhibit hierarchical organizations, where subnetworks (e.g., signaling pathways) are bridged by crosstalk links [22]. A number of developmental processes rely on crosstalks, where their aberrant regulation was found to be associated with in ammatory response defects as well as cancer and neurodegeneration [23,24]. Together with the observation that causal molecular pathways often coincide with the shortest molecular paths between known diseaseassociated components (cf. the network parsimony principle [1]), these render the study of crosstalks in networks fundamental.
Pyntacle eases the exploration of crosstalks by set operations on graphs. Individual networks can thus be compared (union, intersection and di erence) or merged and then studied topologically.
These networks are typically sparse and can be analyzed employing algorithms that work best with graphs with a few edges. Pyntacle is optimized to work with increasingly large and complex networks. It provides the user with the possibility to assess the extent of sparseness of a network though mathematical indices, including the compactness and completeness indices [25,26]. In addition, it chooses the best implementation of computationally heavy algorithms at run-time (e.g., the search for all the shortest paths), according to the available hardware (i.e., single or multi-core processors and GPU-enabled graphics cards) and some network global metrics, including the sparseness.

Data format compatibility and reporting
Pyntacle is compliant with the Cytoscape's SIF data format and with the dot network data format. It can input and output adjacency matrices, edge lists as textual les as well as serialized binary Python objects. Graph, node, or edge attributes can be imported/exported from/to le.
Pyntacle can report any analysis result in two formats: as textual les and as rich HTML les. In particular, the Pynta-cleInk module outputs an interactive, automatically generated web page that displays the graph, its attributes, and all the results of the analyses that were performed on it.

Implementation
Pyntacle is accessible via command line and exposes a Python API for ne-tuning its algorithms. It depends on iGraph [17] for handling the graph data structure and borrowing some basic local and global topological measures and network generators.
Heavy computations of new algorithms are just-in-time compiled to native machine instructions by Numba [27] and thus run on multi-process CPU or NVIDIA-compatible GPU hardware, if available in the hosting computing infrastructure (experimental feature only accessible through APIs in version 1.2). Di erently from similar packages, this allows Pyntacle to process graphs with thousands of nodes, thus helping it manage, for example, the whole human transcriptome and other networks of comparable sizes. Moreover, GPU-acceleration provides high-speed computing of the Pyntacle's algorithms, thereby making heavy and long-running tasks feasible.
The PyntacleInk visualizer exploits HTML5, Javascript and Sigma to produce an interactive representation of the input graph, its base metrics, and a graphic rendering of the results of most of Pyntacle's algorithms (KPP, group centrality, graph generation, set operations, community detection, Figure 6B). A graph can be displayed using di erent layouts (i.e., Random, Circular, ForceAtlas, Fuchterman-Reingold), and the canvas renderer allows the visualization and smooth interaction with graphs up to 5000 nodes in size, using a web browser of a standard desktop PC. All the information about a graph and the analyses that were performed on it are stored in a JSON le; this dictionary is updated with new information whenever a new run of analysis is performed on the same graph, allowing the user to simultaneously explore the results of di erent algorithms and -through the use of timestamps -the results of the same algorithm run with di erent parameters over time. Any graphical representation can be exported as vector graphics (SVG) or PNG screenshots.
Finally, Pyntacle is fully compatible with the Jupyter Notebook.

Benchmarks
Compared with the keyplayer 1.0.3 R package [28] and Key-Player 1.44 [29], Pyntacle has the following attributes: i) available for Windows, Mac, and Linux OS; ii) available as both command line tool and API; iii) it allows management of real-world graphs in a computationally e cient way.
Wall-clock-time (WCT) comparisons of Pyntacle and keyplayer, when searching for optimal kp-sets of some real and simulated graphs, are shown in Figure 4. Noteworthy is that keyplayerKeyPlayer is not rigorously testable here since it is a Windows-only GUI-based application.
Random networks were generated according to the Erdős-Rényi model. Six random networks, three with 100 nodes and three with 1000 nodes, were generated. These two groups of networks di ered for their wiring probability, which varied as 0.3, 0.5 and 0.7. This probability is a kind of weighting function, which ranges from 0 to 1, with bigger numbers producing denser networks. Other four real networks were used: the network representing strong advice-seeking ties in global consulting company from [10] (32 vertices and 55 edges); the parasite-host food web of the Carpinteria Salt Marsh Reserve (128 vertices and 1198 edges) [15]; the C. elegans connectome (a modi ed version of the network published in [30], 279 vertices and 1960 edges) and a high-quality C. elegans PPIprotein-protein interaction network (3303 vertices and 5561 edges, downloaded from APID [31], the Agile Protein Interactomes DataServer [32]).
WCT were measured three times for each network and centrality algorithm. DR, m-reach and DF were the only three algorithms in common between the two software packages. The sub-optimal sets of size 2 were determined by both software using their own implementations of the greedy optimization search algorithm (cf. Search algorithms in the Methods section). Starting from the 100-nodes random networks, Pyntacle computed all indices in fractions of seconds (or a few seconds for DF), irrespective of the wiring probability. keyplayer computed the same indices of the same networks in 4 to 9 minutes. Considering the 1000-nodes networks, keyplayer completed the computation of all indexes in more than one day, while Pyntacle took a few minutes to 5 hours (DF). Similarly, real networks were analyzed in fractions (or tenths for DF) of seconds by Pyntacle and in a few seconds to 1 hour by keyplayer, which took more than 1 day to analyze the APID network, as opposed to Pyntacle that ran for a few minutes to 17 hours. Generally, Pyntacle was 40 to 3, 900 times faster than keyplayer, depending on the test.
The brute-force search algorithm yields exact solutions at the cost of an intrinsic combinatorial complexity. However, its computational load can be split into parallel processors. In Pyntacle, the best solutions are obtained after the enumeration of all possible groups of nodes and the calculation of their topological indices. Calculations are in fact independent from each other and hence suitable to be executed in parallel. When applied to our test networks with the aim to calculate the DR index, we veri ed that the smaller ones (≤ 100 nodes) have bene ted from parallel execution only limitedly. While the strong advice-seeking ties in global consulting company network exhibited the best speed-up with the employment of 4 computing cores (1.76X), before decreasing its performance, the net execution time improvement consisted in fact only of 82 msec, on average (Fig. 5C). Similarly, 100-nodes random networks, proportionally to the rewiring probability, achieved the best speedup values with 16 cores (~8X) with an improvement of just~4 sec (Fig. 5B). As expected, bigger networks bene ted from parallel execution increasingly with the number of nodes. The Carpinteria network achieved the best speedup record (~11X) with 16 cores, although saving just 7 sec. of com- putation, while the Connectome peaked at~25X with 32 cores. (Fig. 5A). The computations of 1000-nodes random networks scaled well up to 16 nodes, exhibiting comparable speedups of 7X,~6X,~6X, when varying the rewiring probability from 0.3 to 0.5 and 0.7, respectively. The bigger APID network exhibited the best performance with 32 cores, achieving a speedup of~29X and terminating the computation 23 hours earlier than the non-parallel run (Fig. 5D).
Although these are far from being linear speedups, the advantage and e cacy of parallel computing strategies is evident for networks of big sizes. These results can be reproduced using a Docker image available from the Pyntacle website.
We have built a network whose edges linked interacting residues of the two proteins with the aim to identify key residues at the interface between the two proteins and whose mutations might signi cantly a ect their interaction ( Figure  6B). Thus, we computed several local topological metrics for these residues, e.g. degree, betweenness, closeness, radiality and a few others, but none of them showed to correlate appreciably with the contribution provided by each residue (Supplementary Figure S1) on the NDUFV1-NDUFV2 interaction energy (∆∆G, expressed in Kcal/mol and calculated with FoldX [34], see Methods): a maximum Pearson correlation of 0.32 was observed between ∆∆G and betweenness.
We then applied Pyntacle to the network, searching for the best positive and negative key-player sets of size 2 (colored in blue and red, respectively, in Figure 6, Supplementary data S4). The residues Glu161 and Tyr46 of NDUFV1 were identi ed as the best negative key-players in the network (according to both F and DF metrics), namely their removal was estimated to maximally fragment the network and thus potentially hamper the interaction between the two proteins. This was further con rmed by their energetic contributions to the interaction when mutated to Alanine (∆∆G +0.9 and +3.7 Kcal/mol, respectively for Glu161 and Tyr46). Moreover, Glu161 of NDUFV1, paired with Leu234 of NDUFV2, and Cys125 of NDUFV1, paired with Tyr46, were identi ed as the best positive key-players pairs (respectively calculated with the DF and m-reach metrics), namely they resulted to be immediately reachable from the remaining network by direct links or indirect links joining close neighbor residues.
Contrary to Glu161 and Tyr46, Leu234 and Cys125 neither exhibited a signi cant ∆∆G when mutated to Alanine, nor were characterized by high values of local topological metrics. For these reasons, they would have been overlooked by all other techniques, even the more accurate and computationally intensive, as the Alanine-scanning [35]. Other than being computationally demanding and impracticable for large-scale analyses, Alanine-scanning is known to be blind to residues which are chemically similar to Alanine, thereby ignoring their epistatic features which are critical in some regions of the interaction interface.
All these issues are overcome with Pyntacle, which allows to look for topologically important groups of residues between proteins e ciently and regardless of their chemical structure.

Case Study 2 -miRNA-miRNA interaction network
microRNAs (miRNAs) are small RNA molecules (18-25 nucleotides) able to regulate gene expression levels through different cellular mechanisms, the most important of which is that a miRNA can recognize di erent messenger RNAs (mR-NAs) as targets and, at the same time, one of those targets can be recognized by multiple miRNAs. Due to the renowned role played by miRNAs in tumorigenesis and cancer progression [36], we have analyzed a miRNA interaction network of patients a ected by breast cancer.
Expression data of healthy and tumor tissue samples of 87 patients a ected by breast cancer were retrieved from The Cancer Genome Atlas (TCGA) and were used to wire 487 miRNAs in two correlation networks, built on healthy and tumor samples, respectively. A functional association between any two miRNAs was assumed to exist if the absolute values of the Pearson's correlation coe cients of their expression levels exceeded 0.5 (cf. Case Study 2 in Methods).  The healthy network was very dense and highly connected (average degree of 40.4), as opposed to the tumor network which was mostly disconnected (average degree of 17.4). Both networks were analyzed with Pyntacle, which identi ed miR-1307-3p and miR-140-3p, as negative key-players and miR-136-5p, miR-484 and miR-127-5p, as positive key-players of the healthy tissue network. These are all associated with the onset, development of breast cancer and some are even used as markers for prognosis [37,38,39,40,41,42] (Supplementary data S5).
It has to be noted that while these miRNAs have individually high betweenness values, it would not have been possible to infer a possible synergistic interaction between them without the key-player analysis of Pyntacle.

Discussion & concluding remarks
The motivations behind Pyntacles come from the constant growth of experimental data sets and from the increasing need to represent and analyze them with computational e cient methods. This rst release of Pyntacle has been designed with these aims and thus to help researchers from di erent scienti c elds and with di erent levels of computing skills to approach Network Biology and bene t from its analytical tools.
We showed the attributes of Pyntacle and its versatility in dealing with di erent problems, but all traceable to possibly big networks of interactive elements. In a former case study, Pyntacle was thus able to identify key-amino acids that were greatly contributing to the formation of a protein-protein interaction interface. This otherwise time-consuming task was accomplished very e ciently by translating the problem into a network analysis task, compared to current approaches that take tens of minutes to handle even small interaction interfaces. In a second case study, Pyntacle was used to analyze the TCGA data set of miRNA expression in breast cancer to build miRNA-miRNA networks. These were analyzed in search of microRNAs that were occupying key positions in the network, which were later discovered to be already known biomarkers or responsible for the onset and progression of breast cancer.
Concluding, Pyntacle represents a starting point for largescale Network Biology studies. Being a modular framework, it will be expanded to handle weighted networks, in the near future, and directed networks, immediately after. These features, which are shared with some other key-player detection methods and tools [48,49,28], are relevant to make Pyntacle fully capable of analyzing all kinds of biological networks. It will also be enriched with new optimization search algorithms and with a new algorithm to compute the set nestedness. New applications and use-cases are envisaged, the currently most concrete is the one concerning the analysis of trajectories of Molecular Dynamics simulation of proteins.

Potential implications
The main problem concerning most of the currently available network analysis tools, which is also the main reason why we made Pyntacle, is that these do not handle networks of medium to big sizes e ciently. This issue not only regards the big networks and the suitability and e ectiveness of the currently available algorithms to analyze them, but also their practicability over a reasonable time axis and in terms of required computing resources. This point is critical for most elds of research, from Biology to Medical and Social sciences, where systems are naturally big and complex (e.g., the whole Proteome, the Diseasome and the Socialnomics to mention a few). In particular, a eld that at the time of this writing is hitting the headlines, i.e. Epidemiology, is greatly developing in terms of the capability to draw contagion maps and predict infection growth over time. These maps are actually networks where nodes are people and edges are relationships occurred in recent and short periods of time. The way these networks could be studied are many. One could be that of determining the front of the infection, namely a group made by healthy people that are close to the a ected people and highly social and to administer a vaccine, in order to curb the infection. Another possibility would be that of determining the minimum possible number of communication routes to be closed at a national level in order to implement a clever lockdown. These and several other options may be implemented in Pyntacle, using ad-hoc algorithms and computing protocols tailored for big networks.

Key-Player and group-centrality metrics
Pyntacle tackles the problem of identifying key-player nodes that, together, optimally di use something through a network or maximally disrupt or fragment a network when removed. The classes of algorithms are thus two: one that measures the importance of a set on the basis of its impact on the remaining nodes of a network, and another that does it by considering the sole properties of the elements of a set. The former class, also known as KPP-Neg, measures the fragmentation of a network because of the removal of a set of nodes. It is composed by the F metrics: which bases on the size s k of its components k; and by the DF metrics: where d ij denotes the distance between the i th and the j th node. It ranges from 1, when all nodes are adjacent to 0, when all nodes are isolates.
The latter class, also known as KPP-Pos, measures the overall cohesion that members of a set have with the remainder of the network and is made by: where n is the size of the graph and d Kj denotes the minimum distance (shortest path) between any member i of the set K and the remaining nodes j in the graph. Similarly, the m-reach metrics counts how many unique nodes can be reached from K in m steps or less. The formulation is: where C K ranges from 0 to nk, n representing the size of the graph and k that of the considered set of nodes. It is important to notice that this index assumes that all paths of length m or less are equally important and that all paths longer than m are wholly irrelevant [10].
Pyntacle further extends the standard network centrality measures of degree, closeness, and betweenness to groups of nodes, in a way that if, e.g., group-degree and degree centrality measures are applied to groups consisting of single elements, they yield identical results. This class of metrics is made by the group-degree centrality measure, which is intended as the number of non-group nodes that are connected to group members. Multiple ties to the same node are counted only once. It is de ned as: where a ij equals 1 when i and j are adjacent nodes, if i ∈ K and j ∈ {V \ K}, and counting a i,j and a v,j only once ∀v ∈ K when a i,j = 1 and a v,j = 1. Hence, the group-degree centrality ranges from 0 to 1, if the group K is completely isolated or fully connected to all other nodes. Group-betweenness centrality of a set K is de ned as the number of shortest paths connecting any two nodes u and v passing through K over the number of all paths between the two.
where u and v are any pair of nodes not belonging to the group K, p u,v (K) represents the number of shortest paths connecting u and v and that traverses K, while p u,v is the total number of shortest paths between u and v. Group-closeness of a group K is de ned as the sum of the minimum, maximum, average distances from the nodes belonging to the group to all other nodes outside the group.
whered ij is the minimum, maximum or average distance between nodes in K and all other nodes.

Search algorithms
When the aim is not to quantify the centrality of a speci c set of nodes, but to discover which is/are the most central set/s of nodes in a network, search heuristics might come in handy. In particular, Pyntacle implements a greedy optimization search heuristics presented in [10], and a brute-force combinatorial optimization search strategy. The former follows this naïve algorithm: while not halt do 7: foreach u ∈ K do 8: foreach v ∈ V \ K do 9: Ktemp = swap(K, u, v) 10: scoretemp = func(Ktemp) 11: scoredict [(u, v)] = scoretemp 12: u best , v best = getPairsWithBestScore(scoredict) 13: bestscore = scoredict[(u best , v best )] 14: if best s core > score then 15: K = swap(K, u best , v best ) 16: score = bestscore 17: else 18: halt = true return (K, score) 19: end func is an appropriate key-player metrics; drawnodes is a function that randomly picks k nodes from V, which contains all nodes of the network; swap (K,u,v) substitutes the element u in K with v; getPairsWithBestScore is a function that returns the pairs which yielded the best centrality score. This method progressively replaces the components of a starting random set K with all other nodes of a graph, calculating one of the formerly mentioned centrality metrics for that group and then stops when a sub-optimal solution is obtained.
The brute-force combinatorial optimization search strategy implemented in Pyntacle loops through all possible groups of a prede ned size (k) and returns only those exhibiting the best scores for any of the previous centrality measure (F). Even if the computation can be performed in parallel (the foreach loop below), it is immediate that the computational complexity of the heuristic method is much lower than that of this method, at the cost of yielding sub-optimal solutions. The algorithm below returns exact solutions, but is computationally impracticable with big networks. allsets = generateCombinations(k, V) 5: foreach set ∈ allsets do 6: scoretemp = func(set) 7: scoredict[set] = scoretemp 8: bestsets = getSetsWithBestScore(scoredict) 9: bestscore = scoredict[bestsets] 10: return (bestsets, bestscore) 11: end generateCombinations is a function that generates all possible sets of nodes of size k picking nodes from V.

Set operations on graphs
Graph union, G1∪G2, is implemented as (V1∪V2, E1∪E2), namely as the union of nodes (V) and edges (E). Graph intersection is de ned as G1∩G2 = (V1∩V2, E1∩E2), where only common nodes and edges are reported in the resulting graph. The di erence between G1 and G2 results in a graph with nodes and edges only present in G1 and not in G2. Since the di erence between graphs is not reciprocal, G1 -G2 = G2 -G1.

Case study 1
Chains A and O of the PDB structure 5xtd were considered for the analysis of NDUFV1 and NDUFV2, respectively. First, the structure has been repaired (RepairPDB module of FoldX) and thus allowing more relaxed residue side-chain rotamers and solving clashes. Then, residues located at the interaction interface and the interaction energy were determined with the Anal-yseComplex module of FoldX. Alanine-scanning of the interface residues was performed by substituting each amino acid with an Alanine residue (BuildModel module). The interaction energy of each mutant has been calculated with the AnalyseComplex module in order to determine the ∆∆G of the mutant compared to the wild-type interaction structure. A residue-residue interaction network has been built, which connected residues, belonging to di erent chains in the complex, if any of their non-hydrogen atoms (both backbone and side-chain) were in a 4.5Å radius from each other. The network has been then analyzed with the keyplayer module of Pyntacle and a greedysearch algorithm was used to nd the optimal key-player sets of size 2, considering all available metrics (F and DF, as KPP-Neg. and DR and m-reach, with m set to 2, as KPP-Pos).

Case study 2
Expression levels of 547 miRNAs in 87 healthy and 87 tumor breast samples were retrieved from TCGA. Separately for healthy and tumor individuals, correlations of expression between any possible pairs of miRNAs (149, 331 total pairs) were calculated by Pearson correlation coe cient. Only signi cant values that exceeded ±0.5 were considered to represent edges connecting miRNAs in the healthy and tumor networks. In both networks, the best KPP-Pos and KPP-Neg sets of size 2 were sought using the Pyntacle's greedy optimization search and calculating F and DF, as KPP-Neg, and DR and m-reach, with m set to 21, as KPP-Pos. Dear Dr. Zauner, We have revised our manuscript replying to all reviewers concerns at our best. In particular: -We have faced the problem of reproducibility, by submitting the definition files of the networks used for our case studies to GigaDB. One of the curators confirmed that a link to these files will be available in case of acceptance of the manuscript.

Availability of source code and requirements
-We have clarified that the Brute-force algorithm, although clearly disadvantageous, can still be useful for case studies encompassing small/medium size networks. This is one possibility to have exact solutions to the search of optimal sets of nodes. As clarified, Pyntacle makes use of fine and coarse-grained parallelism. While the latter is exclusively used with the brute-force implementation, the former is implicitly applied to large-scale networks with the aim to ease their analysis.
-We are particularly grateful to this reviewer, because he/she has helped a lot in identifying critical performance issues and improve the clarity of our presentation. We have tackled this point seriously and answered accordingly.
-We have updated our references, also following the reviewer's suggestions Following this cover letter, the reviewers will find a point-to-point answer to their concerns. Hope they and yourself will appreciate our efforts to make Pyntacle as much appealing, catching and intuitive as possible. All authors have participated in the work and take full public responsibility for the content of the paper. This is an active research field, as evidenced by the review paper , and although a review of this field is clearly outside the scope of this manuscript (technical note), at least a brief mention of the relationship of the KPP to the CNDP, and a citation of (for example) the  review paper would be of value to readers and not leave them with the misleading impression that the KPP has only been studied in a couple of papers and only the brute-force and greedy heuristic algorithms have been studied to solve it.
As well as this general issue, I think some references to the prior use of the KPP (or related CNDP variants) in biology would be welcome, for example in protein-protein interaction (PPI) networks (Boginsnki & Commander, 2009;, and genetic regulatory networks . For a more general perspective on network methods in drug discovery, the extensive review of  could be informative. In the concluding remarks (p. 8) where extension to weighted and directed networks is mentioned, it might be noted that some work exists on the CNDP in directed graphs  and KPP on weighted graphs Yang, 2015;. Indeed the R keyplayer package is claimed to work with both directed and weighted networks (An & Liu, 2016).
Although reproducing results from the many algorithms for solving the KPP or CNDP variants (either exactly i.e. to optimality, or to potentially suboptimal solutions with heuristics) is clearly impractical, rough comparisons to other methods could be facilitated by running the benchmarks on the Critical Node Detection Problem Benchmark Instances (Ventresca 2012), available from http://individual.utoronto.ca/mventresca/cnd.html, which are used in many CNDP papers.
We thank the reviewer for the suggested reads, we integrated them in the revised manuscript. As far as the proposed benchmark is concerned, we think that it would be very informative and useful. However, we prefer to postpone this to the next future for two reasons: 1-as the reviewer has noticed, Pyntacle is equipped with just one heuristic method. We think that the suggested effort would be much more effective to take when Pyntacle will run several more algorithms. This will happen with the next major release, scheduled to come out sometimes in the second half of the next year. 2-we are aware that those presented are preliminary benchmarks and we were struggling with whether to present them or not. We would prefer to keep them at this preliminary level or, eventually, completely removing them, if suggested by the reviewers.
A.1.2. Insufficient references to existing software. Although a review of existing network libraries is beyond the scope of the manuscript, it could be useful for readers to at least mention some other general purpose graph libraries usable from Python such as graph-tool , igraph , NetworkX , and SNAP . (See Section C below for more).
Most particularly, it is noted that Pyntacle uses igraph modules, yet the igraph paper  is not cited.
In addition, although the Analytic Technologies KeyPlayer program and the R keyplayer package (An & Liu, 2016) are mentioned, the influenceR R package  is not mentioned. Although this package only solves KPP-Pos with the DF metric, as I show in Section B below (on a single example) it does so much faster than Pyntacle.
We integrated the references in the revised manuscript.
For Case Study 1, I was easily able to reproduce the 3D protein cartoon in Fig. 6 by downloading the 5XTD structure from the RCSB PDB into PyMOL (Schrödinger, LLC), selecting the appropriate two chains, and locating the key player residues as shown. However to obtain the network of "The 138 residues forming the interaction interface" I would (according to the Methods) have to conduct several operations in FoldX and then extract the network according to a distance threshold described in the Methods. Although it seems there is sufficient detail to reproduce this, it requires a registration to use FoldX and some further work construct the network.
I think it would assist reproducibility (an important consideration in the GigaScience journal) to supply the network as constructed by the authors as a Supplementary Information file (as for example the results of the Pyntacle analysis are already supplied in an Excel file).
Similarly, the network constructed for Case Study 2 could also be made available as Supplementary Information.
We have stored the adjacency matrices of the two case studies in GigaDB. N.b., A member of the GigaScience editorial team confirmed that a link to these files will be made available eventually after the publication of this piece of research. Moreover, we have explained how we made them in order to improve replicability. In fact, in Case Study 1, the 138 residues identified via Foldx are those considered at the interface and for which energetic contributions were calculated. The network we built has been obtained by directly analyzing the PDB file of the complex, connecting residues of different chains, and relying only on distance constraints (therefore avoiding external software like FoldX). As we demonstrate in the case study, valuable information is gained nonetheless, without using external software (FoldX) and in much less time.
swarm, ant colony, stochastic gradient descent, etc.), that the user can select from, is in our plans.
From my understanding of the manuscript (and some experiments I tried with it detailed in Section B below), there are two types of parallelism used in Pyntacle. First, the "fine grained" parallelism it obtains simply from using the Numba library, and second, explicit "coarse grained" parallelism. The first type uses multithreading in lower level numeric libraries, similar to Microsoft R/Open using multithreaded linear algebra routines. The second type is used only for the brute-force algorithm however (and on Linux seems to work by forking new processes rather than using threads). This reviewer is almost correct. Multi-threading is exploited by Numba in the calculation of all shortest-paths (Floyd-Warshall) and only for this task. We do not use any sort of off-the-shelf linear algebra library. Numba allows us to release the global interpreter lock when computing a (not so) restricted set of manipulation tasks on Numpy arrays. With the GIL released, we can spawn threads, handle concurrency, and, therefore, save a for-loop from the computation of Floyd-Warshall. Therefore, all metrics based on the calculation of the shortest-paths deal with this. As a side note, in Pyntacle we have three codes to calculate all the shortest-paths of a graph: -That borrowed from igraph (sequential) -That implemented by us using Numba (multi-threading) -That implemented by us using Numba (GPU) The Brute-force search, on the other hand, relies on the multiprocess Python package and, as correctly noted, is used to fork processes and not threads. It is used exclusively here because Brute-force search knows the exact number of groups to be computed in advance. Although possible, we have discouraged the mixed-use of processes and threads.
It is not entirely clear, but it seems to me that Figure 4 shows results for using the greedy heuristic with a single thread (no parallelism). Thus it compares the same heuristic algorithm implemented in Pyntacle and R/keyplayer, showing that Pyntacle is faster (due to implementation details, being in Python not R, etc.).
This is a fair comparison if both implementations are really using only a single thread. However I found that Pyntacle used multithreading (up to the number of cores on the node it is run on) automatically, unless the NUMBA_NUM_THREADS environment variable is set (I am not familiar with the Numba library and had to find this in its online documentation). I could not find any documentation in the Pyntacle online manuals about parallelism or adjusting the number of threads, except for the -T/--threads option which applies to the brute-force algorithm only (not the greedy heuristic).
The keyplayer R package, however, uses a single thread unless explicitly requested to use parallelism by a function argument.
Hence I think this should be clarified in the manuscript, and some documentation of how parallelism works and how to adjust the number of threads added to the Pyntacle documentation.
Since the command-line interface of Pyntacle was designed for not-experts, we decided to hide one level of parallelism (the fine-grained) to the user and to automatically manage the parallelism related to the calculation of the shortest paths. We used this simple algorithm that we will make clever and adaptive in future releases.

// auto-select the computing mode
if nprocs > 1 // n.b., nprocs is user-defined Let's enable multi-process and disable multi-threading else if size(graph) < 250 or rho(graph)<0.5 //rho measures sparseness Let's disable multi-process and disable multi-threading else Let's disable multi-process and enable multi-threading N.b., we have replaced -T/--threads with -O/--nprocs since it actually refers to the coarsegrained parallelism that forks processes and not threads.
The number of threads of the fine-grained parallelism is internally set to the number of available cpus -1 by default, which can be tuned by setting the env variable NUMBA_NUM_THREADS, as correctly pointed out. However, caution must be paid on this: Numba adjusts the number of active threads on-the-fly according to the current overheads and, hence, the efficiency of parallelism. This means that what specified in the environment variable might not be actually respected. This is the reason why we were not explicit about this aspect. We did not want to cause confusion and frustrating experience to not-expert users.
We have made this clear in the online documentation.
Multi-threading is however controllable via APIs. To actually make fair comparisons, we benchmarked naked functions with code like this: Mixing multi-processing with multi-threading is possible, although discouraged since it requires fine-tuning procs/threads ratios to perform well.
... BruteforceSearch.reachability( graph_rnd, 2, KpposEnum.mreach, None, m=2, cmode=CmodeEnum.cpu, #this enables multi-threading nprocs=3) ... Fig. 5, however, shows the speedup obtained by Pyntacle using the brute-force algorithm with explicit (coarse-grained) parallelism as described, of computing the metric on the independent possible node sets in parallel. This is a valid comparison, however as per A.1.4 above, I question the practicality of the brute-force method at all, given the availability of many more efficient (albeit more complicated, and perhaps requiring a sophisticated ILP solver for example) algorithms for exact solution.
As in point A.1.4, we agree that there is a broad spectrum of better heuristic search algorithms and optimization techniques than the one currently implemented in Pyntacle. It is our goal to integrate them in the next releases of Pyntacle. However, in GitHub, we have created a fork ("feat#31) where we have already implemented the Stochastic Gradient Descent algorithm. This branch is not still merged because this new implementation has to be carefully tested and unittests still need to be done.
The manuscript mentions graphics processing units (GPU) for parallelism, but I found no further details on this. When I run Pyntacle on a node with no GPU, I get a warning message that "NVIDIA: no NVIDIA devices found" which would seem to indicate that NVIDIA (CUDA) GPU support is available, and when I run the pyntacle keyplayer module on a cluster node which Test 1: Key players DF (KPP-Neg) of size k = 2 in the C. elegans connectome (279 nodes), computed with greedy heuristic.
The network data set used is the CAEEL_Connectome.sif file as used in "Case Study 3" from the Pyntacle home page. I used the "pyntacle convert" command to convert it from SIF format to edgelist format for use in the R keyplayers package. Note that the influenceR package only does KPP-Pos so it is not included in this test. So this shows Pyntacle computing the KPP-Neg (DF metric) correctly on this test, two orders of magnitude faster than the R keyplayer software. There is no real difference in the time for Pyntacle on 1 core versus 20 cores, perhaps unimportant when it takes less than 20 seconds anyway. I'm not sure why the keyplayer R package is so much slower with 20 cores than 1 core, but I suspect it is because the parallel mode in that package is for trying to avoid local minima by running the algorithm with different randomization seeds in parallel (An & Liu 2016, p. 264) rather than speeding the computation of a single run.
In running these tests I found a bug in Pyntacle-Ink (the HTML graphical output): for KPP-Neg metric dF it reports both "Initial Value" and "Value" but they appear to be swapped (they are the right way around in the stdout and the Report TSV file however).

Amended
Test 2: Key players DR (KPP-Pos) of size k = 2 in the C. elegans PPI network (3303 nodes), computed with greedy heuristic. (This should be the same test as the benchmark shown in Fig.  4 of the manuscript for (D)).
The network data used here is the APID_CAEEL_Level2_maincomponent.adjm adjacency matrix file as used in "Case Study 2" linked from the Pyntacle software home page, and referred to as "APID" in the manuscript. 2. Perhaps, it would be nice if another example, of a protein-protein functional association network from the STRING database is considered. It would be a biologically motivating example, given the opening discussions about protein interaction networks.

Method
The C.elegans network, collected from APID and used for benchmarking and tests in the manuscript, is built from quality-checked protein-protein interaction data. We clarified this in the text.
3. Could any heuristics be used to cut down the combinatorial search, especially for larger networks. For instance, could nodes with a degree of 1 be ignored? Can the authors suggest some heuristics to simplify computations for larger networks, based on their observations of what nodes typically figure in the "groups".
The greedy optimization heuristics is shown to identify near-optimal solutions in the manuscript. It represents a fast and simple implementation, even if it is only a starting point. As pointed out by reviewer 1, there is a plethora of faster and more efficient heuristic search algorithms that we aim to include in Pyntacle, so that the user can select different algorithms of choice. (N.b., as a first reaction to this review, we have started and almost finished to implement another search algorithm, the Stochastic Gradient Descent, as it is verifiable in the Pyntacle's GitHub website, branch "feat#31"). We think this is the right path to follow. We do not recommend node (network leaves) pruning as it would greatly influence the outcome of different metrics (for example, the m-reach parameter would yield different results since the removed nodes cannot be reached).
4. `pyntacle test` works very well. Will it be possible to have `pyntacle demo` similarly, which can showcase some of the capabilities? Also, can some example networks be included?
In order to gradually introduce the user to Pyntacle, we created a quick-start guide and 3 case studies, all detailed with data explanation, command lines and graphs both in the forms of HTML pages and python notebooks. We feel that a Pyntacle "demo" function would not be as welcoming to new users as the guides and case studies. Several examples can be downloaded directly from the quick-start guide/case studies in the website. Moreover, the adjacency matrices associated to the network analyzed in the two case studies can now be accessed from GigaDB.
5. An example case study with a larger network, such as E. coli from STRING would also add a lot of value to the paper.
The STRING network for C.elegans is one order of magnitude bigger than the E.coli network from the same resource: the STRING C.elegans network counts 18181 nodes (6170 when considering only high-quality interactions with a combined score equal or higher than 900) while the STRING E.coli K12 W3110/MG1655 networks count respectively 4210/4125 nodes (which decrease to 2698/3259 nodes when applying the same high-quality filter). The C.elegans