LinkPred: a high performance library for link prediction in complex networks

The problem of determining the likelihood of the existence of a link between two nodes in a network is called link prediction. This is made possible thanks to the existence of a topological structure in most real-life networks. In other words, the topologies of networked systems such as the World Wide Web, the Internet, metabolic networks, and human society are far from random, which implies that partial observations of these networks can be used to infer information about undiscovered interactions. Significant research efforts have been invested into the development of link prediction algorithms, and some researchers have made the implementation of their methods available to the research community. These implementations, however, are often written in different languages and use different modalities of interaction with the user, which hinders their effective use. This paper introduces LinkPred, a high-performance parallel and distributed link prediction library that includes the implementation of the major link prediction algorithms available in the literature. The library can handle networks with up to millions of nodes and edges and offers a unified interface that facilitates the use and comparison of link prediction algorithms by researchers as well as practitioners.


INTRODUCTION
The field of complex networks, and more generally that of network science, aims at studying networked systems, that is, systems composed of a large number of interacting components (Albert & Barabási, 2002). Under this umbrella fall many seemingly disparate networks, but which share common underlying topological properties that constitute a fertile ground for analyzing and ultimately understanding these systems. Networks of interest can be social, biological, informational, or technological. Link prediction is the task of identifying links missing from a network (Lü & Zhou, 2011;Martínez, Berzal & Cubero, 2017;Guimerà & Sales-Pardo, 2009;Al Hasan et al., 2006;Guimerà & Sales-Pardo, 2009;Clauset, Moore & Newman, 2008;Lü & Zhou, 2011;Cannistraci, Alanis-Lobato & Ravasi, 2013;Daminelli et al., 2015;Al Hasan et al., 2006;Wang, Satuluri & Parthasarathy, 2007;Zhang et al., 2020;Beigi, Tang & Liu, 2020;Sajadmanesh et al., 2019;Makarov et al., 2019), a problem with important applications, such as the reconstruction of networks from partial observations (Guimerà & Sales-Pardo, 2009), recommendation of items in online shops  non-existing edges through C++-style iterators. Also included are auxiliary data structures such as full and sparse node and edge maps.

The network data structures
The life cycle of a network has two distinct phases. In the pre-assembly phase, it is possible to add nodes and edges to the network. It is also possible to access nodes and translate external labels to internal IDs and vice versa. However, most functionalities related to accessing edges are not yet available. As a result, the network at this stage is practically unusable. To be able to use the network, it is necessary to assemble it first. Once assembled, no new nodes or edges can be added (or removed) to the network. The network is then fully functional and can be passed as an argument to any method that requires so. To build a network, an empty network is first created by calling the default constructor: UNetwork < > net ; Most classes in LinkPred manipulate networks through smart pointers for efficient memory management. To create a shared pointer to a UNetwork object: a u t o net = std :: make_shared < UNetwork < > >() ; Notice that the class UNetwork is a class template, which is instantiated with the default template arguments. In this default setting, the labels are of type std::string, whereas internal IDs are of type unsigned int, but UNetwork can be instantiated with several other data types if wanted. For instance, the labels can be of type unsigned int, which may reduce storage size in some situations.
Adding nodes is achieved by calling the method addNode, which takes as parameter the node label and returns an std::pair containing, respectively, the node ID and a Boolean which is set to true if the node is newly inserted, false if the node already exists. The nodes IDs are guaranteed to be contiguous in 0,...,n − 1, where n is the number of nodes. a u t o res = net . addNode ( label ) ; a u t o id = res . first ; // This the node ID b o o l inserted = res . second ; // Was the node inserted or did it already exist ?
The method addEdge is used to create an edge between two nodes specified by their IDs (not their labels): net . addEdge (i , j ) ; The last step in building the network is to assemble it: net . assemble () ; The method assemble initializes the internal data structures and makes the network ready to be used.
Information on edges can only be accessed after assembling the network. One way to access edges is to iterate over all edges in the network. This can be done using the method edgesBegin() and edgesEnd(). As it is the case with nodes, it is possible to access a random sample of edges using rndEdgesBegin and rndEdgesEnd. LinkPred offers the possibility to iterate over negative links in the same way one iterates over positive edges. This can be done using the method nonEdgesBegin() and nonEdgesEnd(): std :: cout << " Start \ tEnd " << std :: endl ; f o r ( a u t o it = net . nonEdgesBegin () ; it != net . nonEdgesEnd () ; ++ it ) { std :: cout << net . start (* it ) << " \ t " << net . end (* it ) << std :: endl ; } It is also possible to iterate over a randomly selected sample of negative links using rndNonEdgesBegin and rndNonEdgesEnd.
To represent directed networks, LinkPred offers the class DNetwork, which offers a very similar interface to UNetwork.

Maps
Maps are a useful way to associate data with nodes and edges. Two types of maps are available in LinkPred: node maps (class NodeMap) and edge maps (class EdgeMap), both member of UNetwork. The first assigns data to the network nodes, whereas the latter maps data to edges (see Fig. 2 for an example).
Creating a node map is achieved by calling the method createNodeMap on the network object. This is a template method with the mapped data type as the only template argument. For example, to create a node map with data type double over the network net: Both NodeMap and EdgeMap offer the same interface, which in fact is similar to std::map. This includes the operator [], the methods at, begin, end, cbegin and cend. From the performance point of view, NodeMap offers constant time access to mapped values, whereas EdgeMap requires logarithmic time access (O(logm), m being the number of edges).
If a node map is sparse, that is, has non-default values only on a small subset of the elements, it is better to use a sparse node map. To create a sparse node map: a u t o nodeSMap = net . t e m p l a t e createNodeSMap < double >(0.0) ; Notice that the method takes as input one parameter that specifies the map's default value (in this case, it is 0.0). Hence, any node which is not explicitly assigned a value is assumed to have the default value 0.0.

Graph algorithms
To facilitate the implementation of link prediction algorithms, LinkPred comes with a set of graph-algorithmic tools such as efficient implementations of graph traversal, shortest path algorithms, and graph embedding methods.

Graph traversal and shortest paths algorithms
LinkPred provides two classes for graph traversal: BFS, for Breadth-First traversal, and DFS for Depth-First traversal. They both inherit from the abstract class GraphTraversal, which declares one virtual method traverse. It takes as parameter the source node from where the traversal starts and a reference to a NodeProcessor object, which is in charge of processing nodes sequentially as they are visited. In addition to graph traversal routines, LinkPred contains an implementation of Dijkstra's algorithm for solving the shortest path problem. To use it, it is first necessary to define a length (or weight) map that specifies the length associated with every edge in the graph. A length map is simply a map over the set of edges, that is, an object of type EdgeMap which can take integer or double values. The class Dijkstra offers two methods for computing distances: • The method getShortestPath, which computes and returns the shortest path between two nodes and its length.
• The method getDist, which returns the distance between a source node and all other nodes. The returned value is a node map, where each node is mapped to a pair containing the distance from the source node and the number of edges in the corresponding shortest path.
Both methods run Dijkstra's algorithm, except that getShortestPath stops once the destination node is reached, whereas getDist continues until all reachable nodes are visited.
Computing shortest-path distances in large networks requires not only considerable time but also significant space resources. Consequently, efficient management of memory is necessary to render the task feasible in such situations. The abstract class NetDistCalculator provides an interface for an additional layer over the class Dijkstra which facilitates its use and can serve to manage memory usage. A NetDistCalculator object is associated with a single length map and provides two methods for computing distances: • getDist(i, j): Computes and returns the distance between the two nodes i and j.
The returned value is an std::pair, with the first element being the distance, whereas the second is the number of hops in the shortest path joining the two nodes.
• getDist(i): Computes and returns a node map containing the distances from node i to all other nodes in the network.
LinkPred has two implementations of NetDistCalculator: ESPDistCalculator, an exact shortest path distance calculator which caches distances according to different strategies to balance memory usage and computation, and ASPDistCalculator, an approximate shortest path distance calculator. The approximation used in ASPDistCalculator works as follows. A set L of nodes called landmarks is selected, and the distance from each landmark to all other nodes is pre-computed and stored in memory. The distance between any two nodes i, j is then approximated by: (1) The landmarks are passed to ASPDistCalculator object using the method setLandmarks. Naturally, by increasing the number of landmarks, more precision can be obtained, be it though at a higher computational and memory cost.
To provide a uniform interface, all embedding algorithms implemented in LinkPred inherit from the abstract class Encoder, which declares the following methods: • The method init, which is first called to initialize the internal data structures of the encoder. This is a pure virtual method of the class Encoder and must be implemented by derived classes.
• Once the encoder is initialized, the method encode, also a pure virtual method, is called to perform the embedding. This step typically involves solving an optimization problem, which can be computationally intensive both in terms of memory and CPU usage, especially for very large networks. The dimension of the embedding space can be queried and set using getDim and setDim respectively.
• The node embedding or the node code, which is the vector of coordinates assigned to the node, can be obtained by calling the method getNodeCode. The edge code is by default the concatenation of its two nodes' codes and can be obtained using getEdgeCode. Hence, in the default case, the edge code dimension is double that of a node. Classes that implement the Encoder interface may change this default behavior if desired. The user can query the dimension of the edge code using the method getEdgeCodeDim.
Having a unified interface for encoders allows embedding algorithms to be easily combined with different classifiers and similarity measures to obtain various link prediction methods, as explained in the next sections. It also allows users to use their own embedding algorithms to build and test new link prediction methods.

Machine learning algorithms
The library contains the implementations of several classifiers and similarity measures that can be combined with graph embedding algorithms (see the previous section) to build a variety of link prediction methods. Available classifiers, most of which are derived from mlpack (Curtin et al., 2013), include logistic regression, feed-forward neural networks, linear support vector machine, and Naive Bayes classifier. All binary classifiers in LinkPred implement the interface Classifier, which provides two important methods: the method learn which trains the classifier on a training set, and the method predict which predicts the output for a given input.
Similar to classifiers, all similarity measures in LinkPred inherit from the abstract class SimMeasure, which defines one method, sim, which computes the similarity between two input vectors. Implemented similarity measures include cosine similarity, dot product similarity, L 1 , L 2 and L p similarity, and Pearson similarity.
LinkPred also supports link prediction algorithms based on graph embedding, where the network is first embedded into a low dimensional vector space, whereby nodes are assigned coordinates in that space while preserving the network's structural properties. These coordinates can be used either to compute the similarity between nodes or as features to train a classifier to discriminate between existing edges (the positive class) and non-existing edges (the negative class) (Goyal & Ferrara, 2018c). LinkPred provides two classes that can be used to build link prediction algorithms based on graph embedding: the class UECLPredictor, which combines an encoder (a graph embedding algorithm) and a classifier, and the class UESMPredictor, which pairs the encoder with a similarity measure as illustrated in Fig. 3.
In addition to algorithms for undirected networks, several adaptations of topological similarity methods to directed networks are available as well. The library offers a unified interface for all link prediction algorithms, simplifying the use and comparison of different prediction methods. The interface is called ULPredictor for predictors in undirected networks and DLPredictor for those in directed networks. Most implemented predictors support shared-memory parallelism, and a large number of them support distributed memory parallelism, allowing LinkPred to take advantage of the power of HPC clusters to handle very large networks.

The predictor interface
As stated above, all link predictors for undirected networks must inherit from the abstract class ULPredictor. It declares three important pure virtual methods that the derivative classes must implement: • The method void init(): This method is used to initialize the predictor's state, including any internal data structures.
• The method void learn(): In algorithms that require learning, it is in this method that the model is built. The learning is separated from prediction because, typically, the model is independent of the set of edges to be predicted.
• The method double score(Edge const & e): returns the score of the edge e (usually a non-existing edge).
In addition to these three basic methods, ULPredictor declares the following three virtual methods, which by default use the method score to assign scores to edges, but which can be redefined by derived classes to achieve better performance: • The method void predict(EdgeRndIt begin, EdgeRndIt end, ScoreRndIt scores): In this method, the edges to be predicted are passed to the predictor in the form of a range (begin, end) in addition to a third parameter (scores) to which the scores are written. This is a virtual method that uses the method score to assign scores to edges and can be redefined by derived classes to provide better performance.
• The method std::pair<NonEdgeIt, NonEdgeIt> predictNeg(ScoreRndIt scores) predicts the score for all negative (non-existing) links in the network. The scores are written into the random output iterator scores. The method returns a pair of iterators begin and end to the range of non-existing links predicted by the method.
• The method std::size_t top(std::size_t k, EdgeRndOutIt eit, ScoreRndIt sit) finds the k negative edges with the top scores. The edges are written to the output iterator eit, whereas the scores are written to sit.
The class ULPredictor offers default implementations for the methods top, predict and predictNeg. Sub-classes may use these implementations or redefine them to achieve better performance.
The abstract class DLPredictor plays the same role as ULPredictor but for link predictors in directed networks. It offers the same interface as the latter but with different default template arguments and methods implementation.

Performance evaluation
LinkPred offers a set of tools that help to streamline the performance evaluation procedure. This includes data setup functionalities, which can be used to create test data by removing and adding edges to ground truth networks. The library also includes efficient implementations of the most important performance measures used in link prediction literature, including the area under the receiver operating characteristic (ROC) curve, the area under the precision-recall (PR) curve, and top precision. The area under the PR curve can be computed using two integration methods: the trapezoidal rule, which uses a linear interpolation between the PR points, and the more accurate nonlinear interpolation method proposed in Davis & Goadrich (2006). In addition to performance measures implementations, LinkPred contains helper classes, namely PerfEvaluator and PerfEvalExp, that facilitate the comparative evaluation of multiple link prediction algorithms using multiple performance measures.
All performance measures inherit from the abstract class PerfMeasure. The most important method in this class is eval which evaluates the value of the performance measure. The performance measure results are written to an object of type PerfResults passed as a parameter of the method. The class PerfResults is defined as std::map<std::string, double>, which allows the possibility of associating several result values with a single performance measure.
An important class of performance measures is performance curves such as ROC and PR curves. They are represented by the abstract class PerfCurve, which inherits from the class PerfMeasure. The class PerfCurve defines a new virtual method getCurve, which returns the performance curve in the form of an std::vector of points. In the remainder of this section, more details of the performance measures implemented in LinkPred are presented.

Receiver operating characteristic curve (ROC)
One of the most important performance measure used in the field of link prediction is the receiver operating (ROC) curve, in which the true positive rate (recall) is plotted against the false positive rate. The ROC curve can be computed using the class ROC. Figure 4A shows an example ROC curve obtained using this class.
The default behavior of the ROC performance measure is to compute the positive and negative edge scores and then compute the area under the curve, which may lead to memory issues with large networks. To compute the area under the curve without storing both types of scores, the class ROC offers a method that streams scores without storing them. To enable this method, call setStrmEnabled(bool) on the ROC object. To specify which scores to stream use the method setStrmNeg(bool). By default, the negative scores are streamed, while the positive scores are stored. Passing false to setStrmNeg switches this. In addition to consuming little memory, the streaming method supports distributed processing (in addition to shared memory parallelism), making it suitable for large networks.

Precision-recall curve
The precision-recall (PR) curve is also a widely used measure of link prediction algorithms' performance. In this curve, the precision is plotted as a function of the recall. The PR curve can be computed using the class PR. The area under the PR curve can be computed using two integration methods: • The trapezoidal rule which assumes a linear interpolation between the PR points.
The second method is more accurate, as linear integration tends to overestimate the area under the curve (Davis & Goadrich, 2006). Furthermore, the implementation of Davis-Goadrich nonlinear interpolation in LinkPred ensures little to no additional cost compared to the trapezoidal method. Figure 4B shows an example PR curve obtained using the class PR.

General performance curves
LinkPred offers the possibility of calculating general performance curves using the class GCurve. A performance curve is, in general, defined by giving the x and y coordinates functions. These are passed as parameters, in the form of lambdas, to the constructor of the class GCurve. The associated performance value is the area under the curve computed using the trapezoidal rule (linear interpolation). For example, the ROC curve can be defined as: GCurve < > cur ( fpr , rec , " ROC " ) ; The two first parameters of the constructors are lambdas having the signature:

Top precision
The top precision measure is defined as the ratio of true positives within the top l scored edges, l > 0 being a parameter of the measure (usually l is set to the number of links removed from the network). Top precision is implemented by the class TPR, and since it is not a curve measure, this class inherits directly from PerfMeasure. The class TPR offers two approaches for computing top-precision. The first approach requires computing the score of all negative links, whereas the second approach calls the method top of the predictor. The first approach is, in general, more precise but may require more memory and time. Consequently, the second approach is the performance measure of choice for very large networks.

Simplified interface and bindings
The simplified interface provides the essential functionalities available in LinkPred via a small number of easy-to-use classes. These classes are very intuitive and can be used with a minimum learning effort. They are ideal for initial use of the library and exploring its main functionalities. Java and Python bindings for the simplified interface are also available, facilitating the library's use by users who are more comfortable using these languages. The simplified interface contains two main classes: Predictor, which allows computing the scores for an input network using all available link prediction algorithms, and the class Evaluator, which can be used for performance evaluation. Also included are simple structures to store prediction and performance results. These classes are designed in a simple way that allows uniform usage across different programming languages.

EXAMPLE USE CASES
This section describes four main use scenarios of the library. The first use case demonstrates the working of the simplified interface in different languages, which is typical for first-time use of the library or for users who prefer to use the library in Python or Java. The second scenario consists in computing the scores of all non-existing links in a network, which is the typical use case for a practitioner working on networked data. Researchers in link prediction are typically interested in implementing new link prediction algorithms, which is presented as the third use case, and evaluating their performance, which is use case number four.

Predicting missing links
When dealing with networked data, a data scientist may be interested in reconstructing a network from partial observations or predicting future interactions. LinkPred offers two ways to solve such problems, computing the scores of all non-existing links and computing top k edges, which may be more efficient for large networks. This section demonstrates how to perform both tasks.

Implementing a new link prediction algorithm
The first step in implementing a new link prediction algorithm is to inherit from ULPredictor and implement the necessary methods. For a minimal implementation, the three methods init, learn and score must at least be defined. To achieve better performance one may want to redefine the three other methods (top, predict and predictNeg).
Suppose one wants to create a very simple link prediction algorithm that assigns as score to (i,j) the score κ i + κ j , the sum of the degrees of the two nodes. In a file named sdpredictor.hpp, write the following code: This predictor is now ready to be used with LinkPred classes and methods including performance evaluating routines. For instance, it is possible to write a code that extracts the edges with the top scores as follows: c l a s s Factory : p u b l i c PEFactory < > { p u b l i c : // Create predictors v i r t u a l std :: vector < std :: shared_ptr < ULPredictor < > > > getPredictors ( std :: shared_ptr < UNetwork < > c o n s t > obsNet ) { std :: vector < std :: shared_ptr < ULPredictor < > > > prs ; // Add predictors prs . push_back ( std :: make_shared < URALPredictor < > >( obsNet ) ) ; prs . push_back ( std :: make_shared < UKABPredictor < > >( obsNet ) ) ; r e t u r n prs ; } // Create performance measures v i r t u a l std :: vector < std :: shared_ptr < PerfMeasure < > > > getPerfMeasures ( TestData < > c o n s t & testData ) { std :: vector < std :: shared_ptr < PerfMeasure < > > > pms ; // Add top -precision pms . push_back ( std :: make_shared < TPR < > >( testData . getNbPos () ) ) ; // Add AUCROC pms . push_back ( std :: make_shared < ROC < > >() ) ; r e t u r n pms ; More use case examples can be found in the library documentation. These include using other link prediction algorithms, computing the scores of a specific set of edges, and other methods for computing the performance of one or several link prediction algorithms.

EXPERIMENTAL RESULTS
In addition to providing an easy interface to use, create and evaluate link prediction algorithms, LinkPred is designed to handle very large networks, which is a quality that is essential for most practical applications. To demonstrate the performance of LinkPred, its time performance is compared to that of the R package linkprediction and the Python packages linkpred, NetworkX and scikit-network. To conduct a fair and meaningful comparison, two issues are to be resolved. First, these packages do not implement the same set of algorithms, and only a limited number of topological similarity methods are implemented by all five libraries. Accordingly, the Resource Allocation index is chosen as the comparison task, since it is implemented by all five packages and exhibits the same network data access patterns as most local methods. The second issue that needs to be addressed is that the libraries under consideration offer programming interfaces with different semantics. For instance, scikit-network computes the score for edges given as input, whereas the R package linkprediction and Python packages Linkpred and NetworkX do not require input and instead return the scores of non-existing links. Furthermore, the Python package linkpred returns the scores of only candidate edges that have a non-zero score. To level the field, the comparison shall consist in computing the scores of all non-existing links, even those with zero scores. All networks used in this experiment are connected due to the restriction imposed by the package linkprediction. A description of these networks is given in Table 4 of the appendix. For the sake of fairness, parallelism is disabled in LinkPred, and all experiments are conducted on a single core of an Intel Core i7-4940MX CPU with 32GB of memory. The time reported in Table 2 is the average execution time over ten runs, excluding the time required to read the network from file. The time for LinkPred is reported for C++ code and the Java and Python bindings. The results show that LinkPred is typically one to two orders of magnitudes faster than the other packages. This, of course, can in part be explained by the interpreted nature of Python and R, but it also highlights the fact that link prediction is a computationally intensive task that is best handled by high-performance software that uses efficient data structures and algorithms. As shown in the table, the Java binding of LinkPred introduces a small overhead compared to its Python binding due to more complex data marshaling in the latter. Nevertheless, the Python binding is significantly faster than the Python packages and, except for a couple of networks, is also faster than linkprediction. Table 3 shows the time taken by LinkPred to complete different link prediction tasks on various hardware architectures. It shows that the library can handle very large networks in relatively small amounts of time, even when the available computational resources are limited.

CONCLUSION AND FUTURE WORK
LinkPred is a distributed and parallel library for link prediction in complex networks. It contains the implementation of the most important link prediction algorithms found in the literature. The library is designed not only to achieve high performance but also to be easy-to-use and extensible. The experiments show that the library can handle very large networks with up to millions of nodes and edges and is one to two orders of magnitude faster than existing Python and R packages. LinkPrted components interact through clearly defined and easy interfaces, allowing users to plug their own components into the library by  implementing these interfaces. In particular, users can integrate their own link prediction algorithms and performance measures seamlessly into the library. This makes LinkPred an ideal tool for practitioners as well as researchers in link prediction. The library can be improved and extended in several ways, such as adding R and Octave/Matlab bindings. Another possibility for improvement is implementing further

3,031 6,474
Wiki Talks (Leskovec, Huttenlocher & Kleinberg, 2010b;Leskovec, Huttenlocher & Kleinberg, 2010a) A symmetrized version of the Wikipedia talk network. A node represents a user, and an edge indicates that one user edited the talk age of another user. Data available at https:// snap.stanford.edu/data/wiki-Talk.html.
2,394,385 4,659,565 World Transport (Guimerà et al., 2005) A worldwide airport network. Nodes represent cities, and edges indicate a flight connecting two cities. The data is available at http://seeslab.info/media/filer_public/63/97/ 63979ddc-a625-42f9-9d3d-8fdb4d6ce0b0/airports.zip. graph embedding algorithms, particularly those based on deep neural networks. Also important is handling dynamic (time-evolving) networks. Finally, sampling-based methods such as SBM and FBM, although producing good results, are only usable with small networks because they are computationally intensive. Distributed implementations of these algorithms will allow using them in practical situations on large networks.