Accurately modeling biased random walks on weighted networks using node2vec+

Abstract Motivation Accurately representing biological networks in a low-dimensional space, also known as network embedding, is a critical step in network-based machine learning and is carried out widely using node2vec, an unsupervised method based on biased random walks. However, while many networks, including functional gene interaction networks, are dense, weighted graphs, node2vec is fundamentally limited in its ability to use edge weights during the biased random walk generation process, thus under-using all the information in the network. Results Here, we present node2vec+, a natural extension of node2vec that accounts for edge weights when calculating walk biases and reduces to node2vec in the cases of unweighted graphs or unbiased walks. Using two synthetic datasets, we empirically show that node2vec+ is more robust to additive noise than node2vec in weighted graphs. Then, using genome-scale functional gene networks to solve a wide range of gene function and disease prediction tasks, we demonstrate the superior performance of node2vec+ over node2vec in the case of weighted graphs. Notably, due to the limited amount of training data in the gene classification tasks, graph neural networks such as GCN and GraphSAGE are outperformed by both node2vec and node2vec+. Availability and implementation The data and code are available on GitHub at https://github.com/krishnanlab/node2vecplus_benchmarks. All additional data underlying this article are available on Zenodo at https://doi.org/10.5281/zenodo.7007164. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Graphs and networks naturally appear in many real-world datasets, including social networks and biological networks. The graph structure provides insightful information about the role of each node in the graph, such as protein function in a protein-protein interaction network (Krishnan et al., 2016;Liu et al., 2020). To more efficiently and effectively mine information from large-scale graphs with thousands or millions of nodes, several node embedding methods have been developed (Cui et al., 2018;Hamilton et al., 2017). Among them, node2vec has been the top choice in bioinformatics due to its superior performance compared to many other methods (Ata et al., 2021;Yue et al., 2019). However, many biological networks, such as Greene et al. (2015) and Johnson and Krishnan (2022), are dense and weighted by construction, which we demonstrate to be undesirable conditions for node2vec that can lead to sub-optimal performance.
Node2vec (Grover and Leskovec, 2016) is a second-order random walk-based embedding method. It is widely used for unsupervised node embedding for various tasks, particularly in computational biology (Nelson et al., 2019), such as for gene function prediction (Liu et al., 2020), disease gene prediction (Ata et al., 2018;Peng et al., 2019), and essential protein prediction (Wang et al., 2021a;Zeng et al., 2021). Some recent works built on top of node2vec aim to adapt node2vec to more specific types of networks (Valentini et al., 2021;Wang et al., 2021b), generalize node2vec to higher dimensions (Hacker, 2021), augment node2vec with additional downstream processing (Chattopadhyay and Ganguly, 2020;Hu et al., 2020), or to study node2vec theoretically (Davison and Austern, 2021;Grohe, 2020;Qiu et al., 2018). Nevertheless, none of these follow-up works account for the fact that node2vec is less effective for weighted graphs, where the edge weights reflect the (potentially noisy) similarities between pairs of nodes. This failing is due to the inability of node2vec to differentiate between small and large edges connecting the previous vertex with a potential next vertex in the random walk, which subsequently causes less accurate modeling of the intended walk bias.
Meanwhile, another line of recent works on graph neural networks (GNNs) has shown remarkable performance in prediction tasks that involve graph structure, including node classification (Bronstein et al., 2021;Wu et al., 2021;Zhang et al., 2021). Although GNNs and embedding methods like node2vec are related in that they both aim at projecting nodes in the graph to a feature space, two main differences set them apart. First, GNNs typically require labeled data, while embedding methods do not. This label dependency makes the embeddings generated by a GNN tied to the quality of the labels, which in some cases, like in biological networks, are noisy and scarce. Second, GNNs typically require node features as input to train, which are not always available. In the absence of given node features, one needs to generate them, and often GNN algorithms use trivial node features such as the constant features or node degree features. These two differences give node embedding methods a unique place in node classification, apart from the GNN methods.
Here, we propose an improved version of node2vec that is more effective for weighted graphs by taking into account the edge weight connecting the previous vertex and the potential next vertex. The proposed method node2vecþ is a natural extension of node2vec; when the input graph is unweighted, the resulting embeddings of node2vecþ and node2vec are equivalent in expectation. Moreover, when the bias parameters are set to neutral, node2vecþ recovers a first-order random walk, just as node2vec does. Finally, we demonstrate the superior performance of node2vecþ through extensive benchmarking on both synthetic datasets and network-based gene classification datasets using various functional gene interaction networks. Node2vecþ is implemented as part of PecanPy (Liu and Krishnan, 2021) and is available on GitHub: https://github.com/ krishnanlab/PecanPy.

Materials and methods
We start by briefly reviewing the node2vec method. Then, we illustrate that node2vec is less effective for weighted graphs due to its inability to identify out edges. Finally, we present a natural extension of node2vec that resolves this issue.

Node2vec overview
In the setting of node embeddings, we are interested in finding a mapping f : V ! R d that maps each node v 2 V to a d-dimensional vector so that the mutual proximity between pairs of nodes in the graph is preserved. In particular, a random walk-based approach aims to maximize the probability of reconstructing the neighborhoods for any node in the graph based on some sampling strategy S. Formally, given a graph G ¼ ðV; EÞ (the analysis generalizes to directed and/or weighted graphs), we want to maximize the log probability of reconstructing the sampled neighborhood N S ðvÞ for each v 2 V: Under the conditional independence assumption, and the parameterization of the probabilities as the softmax normalized inner products (Grover and Leskovec, 2016;Mikolov et al., 2013b), the objective function above simplifies to: In practice, the partition function Z v ¼ P v 0 2V hf ðvÞ; f ðv 0 Þi is approximated by negative sampling (Mikolov et al., 2013a) to save computational time. Given any sampling strategy S, Equation (2) can find the corresponding embedding f, which is achieved in practice by feeding the random walks generated to the skipgram with negative sampling (Mikolov et al., 2013b).
Node2vec devises a second-order random walk as the sampling strategy. Unlike a first-order random walk (Perozzi et al., 2014), where the transition probability of moving to the next vertex v n , denoted as Pðv n jv c Þ, depends only on the current vertex v c , a second-order random walk also depends on the previous vertex v p , with transition probability Pðv n jv c ; v p Þ. It does so by applying a bias factora pq ðv n ; v p Þ to the edge ðv c ; v n Þ 2 E that connects the current vertex and a potential next vertex. This bias factor is a function that depends on the relation between the previous vertex and the potential next vertex, and is parameterized by the return parameter p and the in-out parameter q. In this way, the random walk can be generated based on the following transition probabilities: where the bias factor is defined as: According to this bias factor, node2vec differentiates three types of edges: (i) the return edge, where the potential next vertex is the previous vertex (Fig. 1a); (ii) the out edge, where the potential next vertex is not connected to the previous vertex (Fig. 1b); and (iii) the in edge, where the potential next vertex is connected to the previous vertex (Fig. 1c). Note that the first-order (or unbiased) random walk can be seen as a special case of the second-order random walk where both the return parameter and the in-out parameter are set to neutral (p ¼ 1; q ¼ 1).
We now turn our attention to weighted networks, where the edge weights are not necessarily zeros or ones. Consider the case where v n is connected to v p , but with a small weight (Fig. 1d), i.e. ðv n ; v p Þ 2 E and 0 < wðv n ; v p Þ ( 1. According to the definition of the bias factor, no matter how small wðv n ; v p Þ is, ðv c ; v n Þ would always be considered as an in edge. Since in this case v n and v p are barely connected, ðv c ; v n Þ should in fact be considered as an out edge. In the extreme case of a fully connected weighted graph, where ðv; v 0 Þ 2 E for all v; v 0 2 V, node2vec completely loses its ability to identify out edges.
Thus, node2vec is less effective for weighted networks due to its inability to identify potential out edges where the terminal vertex v n is loosely connected to a previous vertex v p . Next, we propose an extension of node2vec that resolves this issue, by taking into account of the edge weight wðv n ; v p Þ in the bias factor.

Node2vecþ
The main idea of extending node2vec is to identify potential out edges ðv c ; v n Þ 2 E coming from v p , where v n is loosely connected to v p . Intuitively, we can determine the 'looseness' of ðv c ; v n Þ based on some threshold edge value. However, given that the distribution of edge weights of any given node in the graph is not known a priori, it is hard to come up with a reasonable threshold value for all networks. Instead, we define the looseness of ðv c ; v n Þ based on the edge weight statistics for each node v.
Formally, we first definew c ðv; uÞ, a normalized version of the edge weight wðv; uÞ, based on the mean lðvÞ and the standard deviation rðvÞ of the edge weights connecting v, as in Equation (5). In practice, we clip the denominator ofw c ðv; uÞ by a small number (1eÀ6 by default) to prevent divide by zero in some cases when c is set to be negative. Then, we say v 2 V is c -loosely connected (or simply loosely connected if c ¼ 0) to u 2 V ifw c ðv; uÞ < 1. Intuitively, we would like to treat an edge as being 'not connected' if it is 'small enough'. Finally, an edge ðv; uÞ is c -loose if v is c-loosely connected to u, and otherwise it is c -tight. Without loss of generality, we consider the case of c ¼ 0 in the subsequent sections to simplify the notion of looseness.
Based on the definition of looseness of edges, and assuming v p 6 ¼ v n , there are four types of ðv c ; v n Þ edges (see Fig. 1c-f). Following node2vec, we categorize these edge types into in and out edges. Furthermore, to prevent amplification of noisy connections, we added one more edge type called the noisy edge, which is always suppressed.

Out edge
As a direct generalization to node2vec, we consider ðv c ; v n Þ to be an out edge if ðv c ; v n Þ is tight and ðv n ; v p Þ is loose ( Fig. 1b and d). The in-out parameter q then modifies the out edge to differentiate 'inward' and 'outward' nodes, and subsequently leads to Breadth First Search or Depth First Search like searching strategies (Grover and Leskovec, 2016). Unlike node2vec, however, we further parameterize the bias factor a based onw c ðv n ; v p Þ. Any choice of monotonic function should work, but we choose to use the linear interpolation in this study for simplicity and leave it as future work to explore more sophisticated interpolation functions such as the sigmoidal functions. Specifically, for an out edge ðv c ; v n Þ, the bias factor is computed as a cpq ðv p ; v c ; v n Þ ¼ 1 q þ ð1 À 1 q Þw c ðv n ; v p Þ. Thus, the amount of modification to the out edge depends on the level of looseness of ðv n ; v p Þ. When wðv n ; v p Þ ¼ 0, or equivalently ðv n ; v p Þ 6 2 E, the bias factor for ðv c ; v p Þ is 1 q , same as that defined in node2vec.

Noisy edge
We consider ðv c ; v n Þ to be a noisy edge if both ðv c ; v n Þ and ðv n ; v p Þ are loose (Fig. 1e). Heuristically, the noisy edges are not very informative and thus should be suppressed regardless of the setting of q to prevent amplification of noise. Thus, the bias factor for a noisy edge is set to be min 1; 1 q n o .
Notice that by introducing the noisy-edge term, we create discontinuity to the bias factor whenw c ðv n ; v p Þ > 1 andw c ðv c ; v n Þ switches from greater than one to less than one. We provide an alternative solution to node2vecþ in the Supplementary material, which continuously extends the out edge term with the noisy edge term. However, we empirically show that the continuous version of node2vecþ performs no better than node2vecþ. Hence, in the main paper, we stick to the 'discontinuous' but simpler version of node2vecþ.

2.2.3
In edge Finally, we consider ðv c ; v n Þ to be an in edge if ðv n ; v p Þ is tight, regardless of wðv c ; v n Þ ( Fig. 1c and f). The corresponding bias factor is set to neutral as in node2vec.
Combining the above, the bias factor for node2vecþ is defined as follows: Note that the last two cases in Equation (6) include cases of ðv n ; v p Þ 6 2 E. Based on the biased random walk searching strategy using this bias factor, the embedding can be generated accordingly using (2). One can verify, by checking Equation (6), that this is indeed a natural extension of node2vec in the sense that • For an unweighted graph, the node2vecþ is equivalent to node2vec. • When p and q are set to 1, node2vecþ recovers a first-order random walk, same as node2vec does.
Finally, by design, node2vecþ is able to identify potential out edges that would have been obliviated by node2vec.

Synthetic datasets
We start by demonstrating the ability of node2vecþ to identify potential out edges in weighted graphs using a barbell graph and the hierarchical cluster graphs. For simplicity, we fix c ¼ 0 for all experiments in this section.

Barbell graph
A barbell graph, denoted as B, is constructed by connecting two complete graphs of size 20 with a common bridge node (Fig. 2a). All edges in B are weighted 1. There are three types of nodes in B, (i) the bridge node; (ii) the peripheral nodes that connect the two modules with the bridge node; and (iii) the interior nodes of the two modules. By changing the in-out parameter q, node2vec could put the peripheral nodes closer to the bridge node or interior nodes in the embedding space.
When q is large, node2vec suppresses the out edges, e.g. an edge connecting a peripheral node to the bridge node, coming from an interior node. Consequently, the biased random walks are restricted to the network modules. In this case, the transition from the peripheral nodes to the bridge node becomes less likely compared to a firstorder random walk, thus pushing the embeddings between the bridge node and the peripheral nodes away from each other. Conversely, when q is small, the transition between the peripheral nodes and the bridge node is encouraged. In this case, the embeddings of the bridge node and the peripheral nodes are pulled together. To see this, we run node2vec with fixed p ¼ 1, and three different settings of q ¼ ½1; 100; 0:01. Indeed, for q ¼ 100, node2vec tightly clusters interior nodes and pushes the bridge node away from the peripheral nodes, and for q ¼ 0:01, the peripheral nodes are pushed away from the interior nodes (Fig. 2b). Since node2vec and node2vecþ are equivalent when the graph is unweighted (see Section 2), we omit the visualization of node2vecþ embeddings for B in the main paper (see Supplementary material).
Next, we perturb the barbell graph by adding loose edges with edge weights of 0.1, making the graph fully connected. This perturbed barbell graph is denotedB. As expected, node2vec failed to make use of the q parameter (Fig. 2c), since none of the edges are identified as an out edge. On the other hand, node2vecþ can pick up potential out edges and thus qualitatively recovers the desired outcome (Fig. 2d). Note that both node2vec and node2vecþ have similar results forB when q ¼ 1. This confirms that node2vecþ and node2vec are equivalent when p and q are set to neutral, corresponding to embedding with unbiased random walks. Finally, when using non-neutral settings of q, node2vecþ is able to suppress some noisy edges, resulting in less scattered embeddings of the interior nodes (Fig. 2d).

Hierarchical CLUSTER graph
We use a modified version of the CLUSTER dataset (Dwivedi et al., 2022) to further demonstrate the advantage of the node2vecþ due to identifying potential out edges. Specifically, the hierarchical cluster graph K3L2 contains L ¼ 2 levels (3 including the root level) of clusters, and each parent cluster is associated with K ¼ 3 children clusters (Fig. 3a). There are 30 nodes in each cluster, resulting in a total of 390 nodes. To generate the hierarchical cluster graph, we first generate point clouds via a Gaussian process in a latent space so that the Euclidean distance between two points from two sibling clusters is about twice ( ffiffiffi 2 p to be precise) the expected Euclidean distance from one of the two points to a point in the parent cluster, which is set to be 1. The noisiness of the clusters is controlled by the parameter r, which is set to 0.1 by default. These data points are then turned into a fully connected weighted graph using a RBF kernel (see Supplementary material). We consider two different tasks (Fig. 3a), (i) cluster classification: identifying individual cluster identity of each node in the graph and (ii) level classification: identifying the level to which the clusters correspond to. We split the nodes into 10% training and 90% testing and use the multinomial logistic regression model with l2 regularization for prediction. The evaluation process, including the embedding generation, is repeated 10 times, and the final results are reported by Macro F1 scores.
As shown in Figure 3b, the performance of node2vec is not affected by the q parameter because the graph is fully connected. Meanwhile, node2vecþ achieves significantly better performance than node2vec for large q settings for both tasks, demonstrating the ability of node2vecþ to identify potential out edges and use this information to perform localized biased random walks. Similar results are observed on a couple of different hierarchical cluster graphs K3L3, K5L1 and K5L2 (see Supplementary material).
On the other hand, one might suspect that the issue with the fully connected graph can be alleviated by sparsifying the graph based on certain edge weight thresholds. Such an approach is widely adopted as a post-processing step for constructing functional gene interaction networks. Here, we show that even after sparsifying the graph aggressively, node2vecþ still outperforms node2vec. In particular, we sparsify the K3L2 graph using the edge weight threshold 0.45, which is the largest value that keeps the graph connected. We then perform the same evaluation analysis above on this sparsified graph K3L2c45. In this case, node2vec indeed performs significantly better than before the sparsification for both tasks. Nonetheless, node2vecþ achieves even better performance, still out-competing node2vec (Fig. 3c).
Finally, we conduct a fine-grained evaluation analysis, showing that node2vecþ consistently outperforms node2vec under a wide range of conditions, including edge threshold, train-test ratio and noise level (see Supplementary material).

Real-world datasets
Our primary motivation for developing node2vecþ stems from the fact that many functional gene interaction networks are dense and weighted. To systematically evaluate the ability of node2vecþ to  embed such biological networks, we consider various challenging gene classification tasks, including gene function and disease gene predictions. Furthermore, we devise experiments with previously benchmarked datasets BlogCatalog and Wikipedia (Grover and Leskovec, 2016) and confirm that node2vecþ performs equal to or better than node2vec, depending on whether the network is weighted (see Supplementary material).

Datasets
Human functional gene interaction networks: We consider functional gene interaction networks, which is a broader class of gene interaction networks that are routinely used to capture gene functional relationships.
• STRING (Szklarczyk et al., 2021) is an integrative gene interaction network that combines evidence of protein interactions from various sources, such as text-mining, high-throughput experiments, etc. • HumanBase-global is a tissue-naive version of the HumanBase (Greene et al., 2015) tissue-specific networks (previously known as GIANT), which are constructed by integrating hundreds of thousands of publicly available gene expression studies, proteinprotein interactions and protein-DNA interactions via a Bayesian approach, calibrated against high-quality known functional gene interactions. • HumanBaseTop-global is a sparsified version of HumanBaseglobal that eliminates all edges below the prior of 0.1.

Multi-label gene classification tasks:
We follow the procedure detailed in Liu et al. (2020) to prepare the multi-label gene classification datasets. More specifically, we prepare two collections of gene classification tasks (each is called a gene set collection): • GOBP: Gene function prediction tasks derived from the Biological Processes gene sets from The Gene Ontology Consortium (2018). • DisGeNET: Disease gene prediction tasks derived from the disease gene sets from the DisGeNET database (Piñero et al., 2016).
After filtering and cleaning up the raw gene set collections, we end up with $45 functional gene prediction tasks and $100 disease gene prediction tasks (Table 1). These gene classification tasks are challenging primarily due to the scarcity of the labeled examples, with on average 100 and 200 positive examples per task for GOBP and DisGeNET, respectively, relative to the (order of) tens of thousands of nodes in the networks.
We split the genes into 60% training, 20% validation and 20% testing according to the level at which they have been studied in the literature (based on the number of PubMed publications associated with each gene). In particular, the top 60% most well-studied genes are used for training; the 20% least-studied genes are used for testing, and the rest are used for validation. For GNNs, we report the test scores at the epoch where the best validation score is achieved.
On the other hand, we include two popular GNNs, GCN (Kipf and Welling, 2016) and GraphSAGE (Hamilton et al., 2017) in our comparison. Both methods have shown exceptional performance on many node classification tasks, but their performance on the gene classification tasks here still needs to be better studied. For GraphSAGE, we consider the full-batch training strategy with mean pooling aggregation following the Open Graph Benchmark (Hu et al., 2021).

Experiment setup
Evaluation metric: Following (Liu et al., 2020), we use the log 2 auPRC prior as our evaluation metric, which represents the log 2 fold change of the average precision compared to the prior. This metric is more suitable than other commonly used metrics like AUROC as it corrects for the class imbalance issue that is prevalent in the gene classification tasks here, as well as emphasizes the correctness of top predictions.
Tuning GNN parameters: For both GNNs, we train one model for each combination of a network and a gene set collection in an end-to-end fashion. The architectures are fixed to five hidden layers with a hidden dimension of 128. Since the gene interaction networks here do not come with node features, we use the constant feature for GCN and the degree feature for GraphSAGE, respectively. We use the Adam optimizer (Kingma and Ba, 2014) to train the GNNs with 100 000 max number of epochs. The learning rates are tuned via grid search from 10 À5 to 10 À1 based on the validation performance. The optimal learning rates that result in a decent convergence rate without diverging are 0.01 and 0.0005 for GCN and GraphSAGE, respectively (see Supplementary material).

Experimental results
Tuning c significantly improves performance for dense graph: The c parameter in node2vecþ (see Section 2.2) controls the threshold of distinguishing in edges and out edges. A small or negative valued c considers most non-zero edges as out edges. Conversely, a large valued c identifies less out edges. When the input graph is noisy and dense, assigning a larger c (e.g. 1) can act as a stronger denoiser to suppress spurious out edges. Figure 4 compares the gene classification test performance between c ¼ 0 and c ¼ f1; 2g with optimally tuned p, q using the HumanBase-global network. Higher testing scores are achieved by larger c settings, illustrating that, to properly 'denoise' the fully connected weighted graph HumanBase-global, we need to increase the noisy edge thresholds. On the contrary, the difference in performance due to the c settings is less pronounced for sparse networks like HumanBaseTop-global and STRING (see Supplementary material).
GNN methods performs worse than node2vec(1): In all settings, node2vecþ significantly outperforms both GNN methods (Fig. 5). Notably, for the STRING network, both node2vec and node2vecþ outperform the two GNNs by a large margin. The sub-optimal GNN performance here illustrates that, despite being powerful neural network architectures that can leverage the graph structures, GNNs alone cannot learn effectively given a limited number of labeled examples. On the contrary, the embedding processes of node2vec(þ) are task agnostic and can be carried out effectively without labels. These results indicate that gene node2vecþ classification tasks based on gene interaction network are more effectively solved by unsupervised shallow embedding methods than GNNs.nod2vec1 matches or outperforms node2vec: node2vecþ significantly outperforms node2vec [Wilcoxon paired test (Wilcoxon, 1945) P < 0.05] except for the DisGeNET tasks using HumanBaseTop-global and STRING networks, in which cases the two methods perform equally (Fig. 5). The performance differences are especially pronounced when using the fully connected and noisy HumanBase-global network, demonstrating node2vecþ's ability to learn robust node representations in the presence of noise. Nevertheless, when the network is less dense (e.g. HumanBaseTop-global), node2vecþ is still able to perform at least as well as node2vec, indicating that node2vecþ is overall a good replacement of node2vec.

Tissue-specific functional gene classification
A key feature of functional gene interaction networks constructed using gene expression data is capturing biological context specificity, such as tissue-specificity provided by the HumanBase networks.
Thus, we further demonstrate the use case of node2vecþ using tissue-specific functional gene classification tasks derived from Zitnik and Leskovec (2017). After processing, there are 25 tissuespecific functional gene classification tasks, with 12 different tissues found in the HumanBase database. We follow a similar experimental setup as above, and for each tissue-specific functional gene classification task, we report the followings: (i) matched: the prediction performance using the corresponding tissue-specific network; (ii) other: the average prediction performance using tissue-specific networks other than the corresponding tissue; (iii) global: the prediction performance using the tissue-naive network. Figure 6 shows that node2vecþ outperforms node2vec in most scenarios, especially when using the full HumanBase networks. In particular, node2vecþ, using the matched tissue-specific full networks for the given functional gene classification tasks, results in significantly better performance than using other (unrelated) tissuespecific networks, as well as the global (tissue-naive) network. On the contrary, node2vec cannot fully utilize the tissue-specific networks, as indicated by the lack of difference in performance between matched and global networks.
We observe similar results using another collection of tissuespecific co-expression networks, GTExCoExp, that are generated using a benchmarked co-expression network generation workflow by Johnson and Krishnan (2022) (see Supplementary material).

Discussion and conclusion
In this article, we proposed node2vecþ that improves upon the second-order random walk in node2vec for weighted graphs by considering edge weights. Consequently, the corresponding node embeddings are improved whenever the in-out walks positively influence the task (meaning that the optimal q setting is not 1).
We showed that node2vecþ better identifies potential out edges on weighted graphs than node2vec using two synthetic datasets, including the barbell graph and the hierarchical cluster graphs. Furthermore, evaluations on various challenging gene classification tasks demonstrated that embedding methods like node2vec(þ) are superior to GNNs. GNNs learn how to orient the nodes in a lowdimensional space to maximize the separation between nodes of different classes in an end-to-end fashion. The suboptimal GNN performance here highlights their need for a much larger labeled training dataset to fully exploit the expressive power of their architectures. Unfortunately, many real-world biological applications, such as the function or disease gene classification problems here, still lack large amounts of labeled data. For these applications, an unsupervised approach like node2vec(þ) may be more suitable as it arranges the latent space purely based on the underlying graph Fig. 4. Comparison of different c settings in node2vecþ using HumanBase-global. Each dot represents the testing performance ( log 2 auPRC prior ) of a specific gene set, with optimally tuned p and q settings Fig. 5. Gene classification tasks using protein-protein interaction networks. Each panel corresponds to a specific protein-protein interaction network (HumanBase-global, HumanBaseTop-global and STRING). Each point in a boxplot represents the final test score for a specific task (gene set) in the gene set collection (GOBP or DisGeNET). Starred (*) pairs indicate that the performance between node2vec and node2vecþ are significantly different (Wilcoxon P < 0.05) structure, after which a less data-hungry model, such as logistic regression, can be applied to perform the classifications.
Dense weighted graphs are common in biology, directly based on the experiment [e.g. genetic interactions (Costanzo et al., 2016)], by construction [e.g. co-expression (Zhang and Horvath, 2005)] or by integrating multiple network datasets sources [Greene et al., 2015;Szklarczyk et al., 2021]. Network embedding has recently found applications in studying co-expression networks, e.g. in the context of evolutionary and cross-species network alignment (Ovens et al., 2021a,b), cancer prognostic gene identification (Choi et al., 2018) and gene functional interaction prediction (Du et al., 2019). These applications, especially the ones that leverage dense weighted graphs, are likely to benefit from using node2vecþ.
Sparsification using a hard threshold is a common technique for dealing with fully connected weighted graphs like co-expression (Du et al., 2019;Zhang and Horvath, 2005). However, finding the optimal cut threshold could be quite challenging [usually relying on heuristics (Ovens et al., 2021a)], and such thresholding may change the graph significantly in terms of its spectrum (Spielman and Teng, 2010). Node2vecþ, on the other hand, can be seen as a soft thresholding approach that suppresses transitions over noisy edges.
Overall, node2vecþ is a natural extension of node2vec for weighted graphs and has several desirable properties. With its general procedure for biased random walks, node2vecþ can be easily adapted into other methods such as KG2Vec (Wang et al., 2021b) and Het- Fig. 6. Tissue-specific functional gene classification performance comparison between node2vec and node2vecþ using HumanBase and HumanBaseTop tissue-specific networks node2vecþ