Forward backward similarity search in knowledge networks
Introduction
Computing the similarity of two or more objects in an information network is the main focus of a large amount of scientific research and technological development. Friendship recommendation in social networks is one example, but web search, community detection, general link prediction, list augmentation, and dozens of other application areas are all singularly dependent upon some notion of similarly in the underlying networks.
Similarity is multi-faceted; various traits can be used to determine similarity depending on the specific problem domain. Entire fields of research are dedicated to the development of algorithms that effectively and efficiently retrieve objects similar to some query-object, e.g., information retrieval, computer vision, and databases (broadly speaking). Researchers and practitioners understand that network topology plays a critical role in the identification of object similarity [1], [2], [3]. An appreciation of the topological features has led to the development of models of network growth, clustering, prediction, and classification.
Given a query vertex u, what we need is a network similarity metric that finds a target vertex v to be similar if they satisfy the following criteria:
- 1.
u is highly connected to v, and
- 2.
v is highly connected to u
A typical approach used to compute personalized search is to measure the similarity between some query node and a set of candidate target nodes (maybe all other nodes). After the similarities of the candidate nodes have been found, the user is typically presented with a top-K list of candidate nodes ordered by their similarity scores.
For example, in citation networks Case et al. had previously defined six citation behaviors [4], which we simplify into two categories: a) intra-domain citations and b) cross-domain citations. Intra-domain references often include related prior work that is directly related to the referencing paper, and are the type of references that a reader would expect to see included in the experimental comparison section of the referencing paper. On the other hand, cross-domain citations often represent paradigms, platforms, and data sets that come from a separate, loosely-related area. For example, the closely related references of this paper include references to personal PageRank [5], SimRank [6] and personal SALSA [7]; while the loosely related references of this paper include references to DBLP [8], and ArnetMiner [9] datasets, or the reference to the Spark system [10] among others. A good ranking algorithm should be able to distinguish these two types of citations and give the closely related, intra-domain references a higher score than cross-domain references.
Conventional algorithms do not work well on the citation ranking problem for a variety of reasons. To see why, consider the toy example of the citation ranking problem illustrated in Fig. 1 containing 2 communities denoted by white and grey nodes. The task is to rank the references (out-edges) with respect to the query node G. An ideal result would rank the intra-domain (ingroup) references higher than cross-domain (outgroup) references; even downstream references two or more links away from G should, in some instances, be ranked higher than cross-domain references.
According to the forward and backward similarity criteria described above, we would expect, without loss of generality, certain properties from a ranking on the network in Fig. 1.
- •
The query vertex is always top-ranked, i.e., G is most similar to G because .
- •
Paper D is ranked the second highest because it is directly referenced by G and because other referenced papers, E and G, reference it as well.
- •
Papers E and F are tied for third highest because they have a similar topology with respect to the query G; they are ranked behind D because they are not referenced by other papers in the same group.
- •
Paper H is not ranked with D or E and F because it is does not have a large reference reciprocity, i.e., G does not belong to the same community as H.
- •
Papers A, B and C are tied and are ordered after E and F because they are directly referenced from a highly referenced paper D.
Further down the ideal ranking in this example we expect to find H followed by its referenced papers, I, J and K, further followed by incoming citations from papers L, M and N.
The ranked results of many popular similarity measures including personalized PageRank (PPR), SimRank, personalized SALSA (pSALSA), Adamic Adar, and the model proposed in this paper called Forward Backward Similarity (FBS) is shown on the right side of Fig. 1. The ranked results clearly show that FBS, which is based on the bi-directional criteria, provides an ordering close to the ideal ordering that we expect. The differences in performance highlight the assumptions and biases inherent in the existing algorithms: 1) PPR considers E, F and H to be the same because their forward-similarities are the same from the perspective of G; 2) SimRank fails to assign correct similarity scores to vertices that are directly connected to the query vertex because of a problem that SimRank has with computing odd-numbered distances; 3) pSALSA is unable to distinguish indirectly connected vertices; and 4) Adamic Adar gives A, B, C the same rank as E, F because they all have vertex D as a single common neighbor.
In general, the problematic results are due to the different interpretation of node-to-node relationships, i.e., existing methods fail to consider similarity from the perspective of the candidate nodes. Although references, and directed edges in general, are one-way relations, similarity is not. Because of this oversight, the current crop of topological similarity measures may return a poor or unintuitive results.
Because network communities are often defined as being a closely connected or tight-knit groups of nodes, an inherent side effect of two-way similarity search is a greater likelihood of rating two nodes as being highly similar if they belong to the same community. So we expect that any improvement in network similarity should be reflected in the results of community detection algorithms.
The core idea of the present work is to declare two vertices u and v to be similar if u is highly connected to v and v is highly connected to u.
In the following sections we present a forward backward adaptation of stochastic similarity search algorithms (FBS) that can be “plugged-in” to many existing similarity search systems. The FBS-adaptation creates a dual-perspective similarity score that satisfies the forward and backward criteria introduced above. Next, we show how the forward backward similarity search can be used to improve community analysis and link prediction, and we propose a new task called Wikipedia Category Selection that ties a given Wikipedia page with its most similar top level category. Finally, we present a qualitative study that compares the top similarity results for 5 well known data mining researchers.
Section snippets
Forward backward similarity search
To address the problems presented in the previous section, we propose a bi-directional adaptation to stochastic search algorithms to create a forward backward similarity search (FBS) system.
Let denote a graph containing vertices and edges . The similarity score of v given some query vertex u is defined as where is the similarity score of v on graph from the perspective of u, and f represents an arbitrary combination function, such as linear
Experiments
The notion of relatedness or similarity plays a critical part in data mining and machine learning algorithms where models are induced by finding intra-cluster similarity and inter-cluster separability, in the case of clustering algorithms, or by drawing a hyperplane comparing class-instances in the case of classification algorithms.
Because of built-in biases and assumptions, similarity measures may succeed in one task, only to fail in many others. To show the robustness of FBS we performed four
Related work
Local neighborhood similarity measures (see [23] for a comprehensive study) count the number of common neighbors between two vertices weighted by the total number of edges for each vertex [18]. These local measures perform impressively on link prediction or concept similarity tasks [24]. Yet, because local similarity measures only look at the ego networks of the query and target nodes, they will not work if the query and target are separated by more than one hop, even if they are highly
Conclusions
In the present work we argue that network similarity should be considered from the perspective of the target and the source nodes. To that end, we have proposed a dual perspective similarity metric called Forward Backward Similarity (FBS) that calculates network similarity based on the perspective of both the query node and the candidate endpoint. Additionally, FBS can be “plugged-in” to many existing network similarity algorithms, thereby extending its use to many different situations.
Acknowledgements
This work is supported by the Templeton Foundation under grant FP053369-M/O.
References (54)
- et al.
Discriminative predicate path mining for fact checking in knowledge graphs
Knowl. Based Syst.
(2016) - et al.
Friends and neighbors on the web
Soc. Netw.
(2003) - et al.
Ontology-based semantic similarity: a new feature-based approach
Expert Syst. Appl.
(2012) - et al.
Combating Web Spam with Trustrank
VLDB, VLDB Endowment, New York
(2004) - et al.
Object-Level Ranking
WWW
(2005) - et al.
Algorithmic computation and approximation of semantic similarity
WWW
(2006) - et al.
A semantic similarity measure for linked data: an information content-based approach
Knowl. Based Syst.
(2016) - et al.
How can we investigate citation behavior? a study of reasons for citing literature in communication
JASIST
(2000) Topic-Sensitive Pagerank
WWW, IW3C3, Geneva
(2002)- et al.
Simrank: A Measure of Structural-context Similarity
KDD
(2002)
Fast incremental and personalized pagerank
VLDB
The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives
SPIRE
Arnetminer: Extraction and Mining of Academic Social Networks
KDD
Spark: Cluster Computing with Working Sets
HotCloud, USENIX, Berkeley
Fast Incremental Proximity Search in Large Graphs
Proceedings of the 25th international conference on Machine learning
Online community detection for large complex networks
PLoS ONE
Fast Algorithm for Modularity-based Graph Clustering
AAAI
Analysis of Community Structure in Wikipedia
WWW, IW3C2, Geneva
Modularity and community structure in networks
Proc. Nat. Acad. Sci.
Finding community structure in very large networks
Phys. Rev. E
Scalable Similarity Search for Simrank
SIGMOD
The link-prediction problem for social networks
J. Am. Soc. Inf. Sci. Technol.
Supervised Link Prediction Using Multiple Sources
ICDM
Link prediction in citation networks
JASIST
Citation Prediction in Heterogeneous Bibliographic Networks
SDM
Theoretical justification of popular link prediction heuristics
IJCAI
Transitive Node Similarity for Link Prediction in Social Networks with Positive and Negative Links
RecSys
Cited by (8)
Two-stage routing with optimized guided search and greedy algorithm on proximity graph[Formula presented]
2021, Knowledge-Based SystemsGraph convolutional networks with multi-level coarsening for graph classification
2020, Knowledge-Based SystemsCitation Excerpt :Graphs are a kind of non-Euclidean data structure for characterizing a set of objects (i.e., nodes) and their relations (i.e., edges) [1]. In practice, graphs with irregular structures naturally occur in a wide diversity of scenarios, ranging from social networks [2,3], knowledge networks [4,5] to protein networks [6,7]. Many real-world applications involve the analysis of graphs, such as graph classification, node classification, node recommendation, link prediction, node visualization, etc.
Incremental C-Rank: An effective and efficient ranking algorithm for dynamic Web environments
2019, Knowledge-Based SystemsCitation Excerpt :Moreover, our incremental C-Rank can be applied to a variety of portals or platforms as long as the concept of contribution is beneficial. For example, C-Rank can be employed in scientific literature search engines [38–41]. Here, each research paper would correspond to a web page, and each citation in the paper would correspond to a hyperlink in a web page similar to a paper that contributes to another paper via a citation [35–37].
A semantic-rich similarity measure in heterogeneous information networks
2018, Knowledge-Based SystemsCitation Excerpt :This framework included many similarity tools and allowed users to compute semantic similarities. In the article [22], the authors studied the similarity search problem in social and knowledge networks and proposed a dual-perspective similarity metric called forward backward similarity. In this section, we introduce some important concepts related to HINs including network schema, meta-path and meta-structure.
Exploiting semantic similarity for named entity disambiguation in knowledge graphs
2018, Expert Systems with ApplicationsCitation Excerpt :Moreover, recent work of Meymandpour and Davis (2016) made a survey on state of the art of semantic similarity and its application in terms of LOD, and presents an information content-based approach to compute semantic similarity between entities considering the relative importance of various types of entity features available in LOD. Shi, Yang, and Weninger (2017) proposed a dual perspective similarity metric that calculates the similarity between nodes in the network based on the perspective of both the query node and the candidate endpoint, whose effectiveness has been validated in scenarios such as community analysis and link prediction. These similarity methods are proposed for more general semantic network and focused on entity level resources.
Personalized graph pattern matching via limited simulation
2018, Knowledge-Based SystemsCitation Excerpt :Some adapted versions of graph simulation have been used to find matches for patterns in social network [10]. Graph pattern matching is often based on the similarity of nodes, which is an important notion in a great number of applications about information network [19]. In particular, Milner [20] proposed the notion of k-limited bisimilarity (also known as k-bisimilarity).