Elsevier

Knowledge-Based Systems

Volume 119, 1 March 2017, Pages 20-31
Knowledge-Based Systems

Forward backward similarity search in knowledge networks

https://doi.org/10.1016/j.knosys.2016.11.025Get rights and content

Abstract

Similarity search is a fundamental problem in social and knowledge networks like GitHub, DBLP, Wikipedia, etc. Existing network similarity measures are limited because they only consider similarity from the perspective of the query node. However, due to the complicated topology of real-world networks, ignoring the preferences of target nodes often results in odd or unintuitive performance. In this work, we propose a dual perspective similarity metric called Forward Backward Similarity (FBS) that efficiently computes topological similarity from the perspective of both the query node and the perspective of candidate nodes. The effectiveness of our method is evaluated by traditional quantitative ranking metrics and large-scale human judgement on four large real world networks. The proposed method matches human preference and outperforms other similarity search algorithms on community overlap and link prediction. Finally, we demonstrate top-5 rankings for five famous researchers on an academic collaboration network to illustrate how our approach captures semantics more intuitively than other approaches.

Introduction

Computing the similarity of two or more objects in an information network is the main focus of a large amount of scientific research and technological development. Friendship recommendation in social networks is one example, but web search, community detection, general link prediction, list augmentation, and dozens of other application areas are all singularly dependent upon some notion of similarly in the underlying networks.

Similarity is multi-faceted; various traits can be used to determine similarity depending on the specific problem domain. Entire fields of research are dedicated to the development of algorithms that effectively and efficiently retrieve objects similar to some query-object, e.g., information retrieval, computer vision, and databases (broadly speaking). Researchers and practitioners understand that network topology plays a critical role in the identification of object similarity [1], [2], [3]. An appreciation of the topological features has led to the development of models of network growth, clustering, prediction, and classification.

Given a query vertex u, what we need is a network similarity metric that finds a target vertex v to be similar if they satisfy the following criteria:

  • 1.

    u is highly connected to v, and

  • 2.

    v is highly connected to u

A typical approach used to compute personalized search is to measure the similarity between some query node and a set of candidate target nodes (maybe all other nodes). After the similarities of the candidate nodes have been found, the user is typically presented with a top-K list of candidate nodes ordered by their similarity scores.

For example, in citation networks Case et al. had previously defined six citation behaviors [4], which we simplify into two categories: a) intra-domain citations and b) cross-domain citations. Intra-domain references often include related prior work that is directly related to the referencing paper, and are the type of references that a reader would expect to see included in the experimental comparison section of the referencing paper. On the other hand, cross-domain citations often represent paradigms, platforms, and data sets that come from a separate, loosely-related area. For example, the closely related references of this paper include references to personal PageRank [5], SimRank [6] and personal SALSA [7]; while the loosely related references of this paper include references to DBLP [8], and ArnetMiner [9] datasets, or the reference to the Spark system [10] among others. A good ranking algorithm should be able to distinguish these two types of citations and give the closely related, intra-domain references a higher score than cross-domain references.

Conventional algorithms do not work well on the citation ranking problem for a variety of reasons. To see why, consider the toy example of the citation ranking problem illustrated in Fig. 1 containing 2 communities denoted by white and grey nodes. The task is to rank the references (out-edges) with respect to the query node G. An ideal result would rank the intra-domain (ingroup) references higher than cross-domain (outgroup) references; even downstream references two or more links away from G should, in some instances, be ranked higher than cross-domain references.

According to the forward and backward similarity criteria described above, we would expect, without loss of generality, certain properties from a ranking on the network in Fig. 1.

  • The query vertex is always top-ranked, i.e., G is most similar to G because G=G.

  • Paper D is ranked the second highest because it is directly referenced by G and because other referenced papers, E and G, reference it as well.

  • Papers E and F are tied for third highest because they have a similar topology with respect to the query G; they are ranked behind D because they are not referenced by other papers in the same group.

  • Paper H is not ranked with D or E and F because it is does not have a large reference reciprocity, i.e., G does not belong to the same community as H.

  • Papers A, B and C are tied and are ordered after E and F because they are directly referenced from a highly referenced paper D.

Further down the ideal ranking in this example we expect to find H followed by its referenced papers, I, J and K, further followed by incoming citations from papers L, M and N.

The ranked results of many popular similarity measures including personalized PageRank (PPR), SimRank, personalized SALSA (pSALSA), Adamic Adar, and the model proposed in this paper called Forward Backward Similarity (FBS) is shown on the right side of Fig. 1. The ranked results clearly show that FBS, which is based on the bi-directional criteria, provides an ordering close to the ideal ordering that we expect. The differences in performance highlight the assumptions and biases inherent in the existing algorithms: 1) PPR considers E, F and H to be the same because their forward-similarities are the same from the perspective of G; 2) SimRank fails to assign correct similarity scores to vertices that are directly connected to the query vertex because of a problem that SimRank has with computing odd-numbered distances; 3) pSALSA is unable to distinguish indirectly connected vertices; and 4) Adamic Adar gives A, B, C the same rank as E, F because they all have vertex D as a single common neighbor.

In general, the problematic results are due to the different interpretation of node-to-node relationships, i.e., existing methods fail to consider similarity from the perspective of the candidate nodes. Although references, and directed edges in general, are one-way relations, similarity is not. Because of this oversight, the current crop of topological similarity measures may return a poor or unintuitive results.

Because network communities are often defined as being a closely connected or tight-knit groups of nodes, an inherent side effect of two-way similarity search is a greater likelihood of rating two nodes as being highly similar if they belong to the same community. So we expect that any improvement in network similarity should be reflected in the results of community detection algorithms.

The core idea of the present work is to declare two vertices u and v to be similar if u is highly connected to v and v is highly connected to u.

In the following sections we present a forward backward adaptation of stochastic similarity search algorithms (FBS) that can be “plugged-in” to many existing similarity search systems. The FBS-adaptation creates a dual-perspective similarity score that satisfies the forward and backward criteria introduced above. Next, we show how the forward backward similarity search can be used to improve community analysis and link prediction, and we propose a new task called Wikipedia Category Selection that ties a given Wikipedia page with its most similar top level category. Finally, we present a qualitative study that compares the top similarity results for 5 well known data mining researchers.

Section snippets

Forward backward similarity search

To address the problems presented in the previous section, we propose a bi-directional adaptation to stochastic search algorithms to create a forward backward similarity search (FBS) system.

Let G=(V,E) denote a graph G containing vertices V and edges E. The similarity score of v given some query vertex u is defined as sG(u,v)=f(πG,u(v),πG,v(u)),where πG,u(v) is the similarity score of v on graph G from the perspective of u, and f represents an arbitrary combination function, such as linear

Experiments

The notion of relatedness or similarity plays a critical part in data mining and machine learning algorithms where models are induced by finding intra-cluster similarity and inter-cluster separability, in the case of clustering algorithms, or by drawing a hyperplane comparing class-instances in the case of classification algorithms.

Because of built-in biases and assumptions, similarity measures may succeed in one task, only to fail in many others. To show the robustness of FBS we performed four

Related work

Local neighborhood similarity measures (see [23] for a comprehensive study) count the number of common neighbors between two vertices weighted by the total number of edges for each vertex [18]. These local measures perform impressively on link prediction or concept similarity tasks [24]. Yet, because local similarity measures only look at the ego networks of the query and target nodes, they will not work if the query and target are separated by more than one hop, even if they are highly

Conclusions

In the present work we argue that network similarity should be considered from the perspective of the target and the source nodes. To that end, we have proposed a dual perspective similarity metric called Forward Backward Similarity (FBS) that calculates network similarity based on the perspective of both the query node and the candidate endpoint. Additionally, FBS can be “plugged-in” to many existing network similarity algorithms, thereby extending its use to many different situations.

Acknowledgements

This work is supported by the Templeton Foundation under grant FP053369-M/O.

References (54)

  • B. Bahmani et al.

    Fast incremental and personalized pagerank

    VLDB

    (2010)
  • M. Ley

    The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives

    SPIRE

    (2002)
  • J. Tang et al.

    Arnetminer: Extraction and Mining of Academic Social Networks

    KDD

    (2008)
  • M. Zaharia et al.

    Spark: Cluster Computing with Working Sets

    HotCloud, USENIX, Berkeley

    (2010)
  • P. Sarkar et al.

    Fast Incremental Proximity Search in Large Graphs

    Proceedings of the 25th international conference on Machine learning

    (2008)
  • G. Pan et al.

    Online community detection for large complex networks

    PLoS ONE

    (2014)
  • H. Shiokawa et al.

    Fast Algorithm for Modularity-based Graph Clustering

    AAAI

    (2013)
  • D. Lizorkin et al.

    Analysis of Community Structure in Wikipedia

    WWW, IW3C2, Geneva

    (2009)
  • M.E. Newman

    Modularity and community structure in networks

    Proc. Nat. Acad. Sci.

    (2006)
  • A. Clauset et al.

    Finding community structure in very large networks

    Phys. Rev. E

    (2004)
  • M. Kusumoto et al.

    Scalable Similarity Search for Simrank

    SIGMOD

    (2014)
  • D. Liben-Nowell et al.

    The link-prediction problem for social networks

    J. Am. Soc. Inf. Sci. Technol.

    (2007)
  • Z. Lu et al.

    Supervised Link Prediction Using Multiple Sources

    ICDM

    (2010)
  • N. Shibata et al.

    Link prediction in citation networks

    JASIST

    (2012)
  • X. Yu et al.

    Citation Prediction in Heterogeneous Bibliographic Networks

    SDM

    (2012)
  • P. Sarkar et al.

    Theoretical justification of popular link prediction heuristics

    IJCAI

    (2011)
  • P. Symeonidis et al.

    Transitive Node Similarity for Link Prediction in Social Networks with Positive and Negative Links

    RecSys

    (2010)
  • Cited by (8)

    • Graph convolutional networks with multi-level coarsening for graph classification

      2020, Knowledge-Based Systems
      Citation Excerpt :

      Graphs are a kind of non-Euclidean data structure for characterizing a set of objects (i.e., nodes) and their relations (i.e., edges) [1]. In practice, graphs with irregular structures naturally occur in a wide diversity of scenarios, ranging from social networks [2,3], knowledge networks [4,5] to protein networks [6,7]. Many real-world applications involve the analysis of graphs, such as graph classification, node classification, node recommendation, link prediction, node visualization, etc.

    • Incremental C-Rank: An effective and efficient ranking algorithm for dynamic Web environments

      2019, Knowledge-Based Systems
      Citation Excerpt :

      Moreover, our incremental C-Rank can be applied to a variety of portals or platforms as long as the concept of contribution is beneficial. For example, C-Rank can be employed in scientific literature search engines [38–41]. Here, each research paper would correspond to a web page, and each citation in the paper would correspond to a hyperlink in a web page similar to a paper that contributes to another paper via a citation [35–37].

    • A semantic-rich similarity measure in heterogeneous information networks

      2018, Knowledge-Based Systems
      Citation Excerpt :

      This framework included many similarity tools and allowed users to compute semantic similarities. In the article [22], the authors studied the similarity search problem in social and knowledge networks and proposed a dual-perspective similarity metric called forward backward similarity. In this section, we introduce some important concepts related to HINs including network schema, meta-path and meta-structure.

    • Exploiting semantic similarity for named entity disambiguation in knowledge graphs

      2018, Expert Systems with Applications
      Citation Excerpt :

      Moreover, recent work of Meymandpour and Davis (2016) made a survey on state of the art of semantic similarity and its application in terms of LOD, and presents an information content-based approach to compute semantic similarity between entities considering the relative importance of various types of entity features available in LOD. Shi, Yang, and Weninger (2017) proposed a dual perspective similarity metric that calculates the similarity between nodes in the network based on the perspective of both the query node and the candidate endpoint, whose effectiveness has been validated in scenarios such as community analysis and link prediction. These similarity methods are proposed for more general semantic network and focused on entity level resources.

    • Personalized graph pattern matching via limited simulation

      2018, Knowledge-Based Systems
      Citation Excerpt :

      Some adapted versions of graph simulation have been used to find matches for patterns in social network [10]. Graph pattern matching is often based on the similarity of nodes, which is an important notion in a great number of applications about information network [19]. In particular, Milner [20] proposed the notion of k-limited bisimilarity (also known as k-bisimilarity).

    View all citing articles on Scopus
    View full text