NScale: neighborhood-centric large-scale graph analytics in the cloud

Quamar, Abdul; Deshpande, Amol; Lin, Jimmy

doi:10.1007/s00778-015-0405-2

NScale: neighborhood-centric large-scale graph analytics in the cloud

Regular Paper
Published: 13 October 2015

Volume 25, pages 125–150, (2016)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Abdul Quamar¹,
Amol Deshpande¹ &
Jimmy Lin¹

2164 Accesses
33 Citations
Explore all metrics

Abstract

There is an increasing interest in executing complex analyses over large graphs, many of which require processing a large number of multi-hop neighborhoods or subgraphs. Examples include ego network analysis, motif counting, finding social circles, personalized recommendations, link prediction, anomaly detection, analyzing influence cascades, and others. These tasks are not well served by existing vertex-centric graph processing frameworks, where user programs are only able to directly access the state of a single vertex at a time, resulting in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks ignore the challenges in extracting the relevant portions of the graph that an analysis task is interested in, and loading those onto distributed memory. This paper introduces NScale, a novel end-to-end graph processing framework that enables the distributed execution of complex subgraph-centric analytics over large-scale graphs in the cloud. NScale enables users to write programs at the level of subgraphs rather than at the level of vertices. Unlike most previous graph processing frameworks, which apply the user program to the entire graph, NScale allows users to declaratively specify subgraphs of interest. Our framework includes a novel graph extraction and packing (GEP) module that utilizes a cost-based optimizer to partition and pack the subgraphs of interest into memory on as few machines as possible. The distributed execution engine then takes over and runs the user program in parallel on those subgraphs, restricting the scope of the execution appropriately, and utilizes novel techniques to minimize memory consumption by exploiting overlaps among the subgraphs. We present a comprehensive empirical evaluation comparing against three state-of-the-art systems, namely Giraph, GraphLab, and GraphX, on several real-world datasets and a variety of analysis tasks. Our experimental results show orders-of-magnitude improvements in performance and drastic reductions in the cost of analytics compared to vertex-centric approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

For input graphs represented as an edge list with the vertex attributes available as a separate mapping, we have a minor modification to the first stage that uses a MapReduce job to join the edge and vertex data and produce a distributed adjacency list in the required format.
We use the terms partitions and bins interchangeably in this paper.
The higher the value of k, the better the quality of the result. We have chosen \(k = 6\) for our implementation which was determined experimentally to strike a fine balance between the quality of shingle-based similarity and computation time.

References

Akoglu, L., McGlohon, M., Faloutsos, C.: OddBall: spotting anomalies in weighted graphs. In: PAKDD (2010)
Apache Giraph. http://giraph.apache.org
Backstrom, L., Leskovec, J.: Supervised random walks: predicting and recommending links in social networks. In: WSDM (2011)
BluePrints API. https://github.com/tinkerpop/blueprints/wiki
Burt, R.S.: Secondhand brokerage: evidence on the importance of local structure for managers, bankers, and analysts. Acad. Manag. J. 50(1), 119–148 (2007)
Article MathSciNet Google Scholar
Burt, R.S.: Structural Holes: The Social Structure of Competition. Harvard University Press, Cambridge (2009)
Google Scholar
Cheng, J., Ke, Y., Ng, W., Lu, A.: Fg-index: towards verification-free query processing on graph databases. In: SIGMOD (2007)
Cheng, R., Hong, J., Kyrola, A., Miao, Y., Weng, X., Wu, M., Yang, F., Zhou, L., Zhao, F., Chen, E.: Kineograph: taking pulse of a fast-changing and connected world. In: EuroSys (2012)
Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1367–1372 (2004)
Article Google Scholar
Curtiss, M., Becker, I., Bosman, T., Doroshenko, S., Grijincu, L., Jackson, T., Kunnatur, S., Lassen, S., Pronin, P., Sankar, S., Shen, G., Woss, G., Yang, C., Zhang, N.: Unicorn: a system for searching the social graph. In: Proceedings of VLDB Endowment (2013)
Everett, M., Borgatti, S.P.: Ego network betweenness. Soc. Netw. 27(1), 31–38 (2005)
Article Google Scholar
Furnace. https://github.com/tinkerpop/furnace/wiki
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: OSDI (2014)
Granovetter, M.S.: The strength of weak ties. Am. J. Sociol. 78, 1360–1380 (1973)
Article Google Scholar
Gremlin. http://github.com/tinkerpop/gremlin/wiki
Gupta, P., Goel, A., Lin, J., Sharma, A., Wang, D., Zadeh, R.: WTF: the who to follow service at twitter. In: WWW (2013)
He H., Singh, A.K.: Graphs-at-a-time: query language and access methods for graph databases. In: SIGMOD (2008)
Hoque, I., Gupta, I.: Lfgraph: simple and fast distributed graph analytics. In: TRIOS (2013)
Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. In: PVLDB (2011)
Izumi, T., Yokomaru, T., Takahashi, A., Kajitani, Y.: Computational complexity analysis of set-bin-packing problem. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 81(5), 842–849 (1998)
Google Scholar
Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20, 1746–1758 (2004)
Article Google Scholar
Kolountzakis, M.N., Miller, G.L., Peng, R., Tsourakakis, C.E.: Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Math. 8, 161–185 (2012)
Article MathSciNet MATH Google Scholar
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35–40 (2010)
Article Google Scholar
Lee, J., Han, W.-S., Kasperovics, R., Lee, J.-H.: An in-depth comparison of subgraph isomorphism algorithms in graph databases. In: PVLDB (2013)
Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: SIGKDD (2006)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. In: PVLDB (2012)
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD (2010)
McAuley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: NIPS (2012)
Metis. http://glaros.dtc.umn.edu/gkhome/metis
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298, 824–827 (2002)
Mongiov, M., Natale, R.D., Giugno, R., Pulvirenti, A., Ferro, A., Sharan, R.: Sigma: a set-cover-based inexact graph matching algorithm. J. Bioinform. Comput. Biol. 8, 199–218 (2010)
Article Google Scholar
Moustafa, W.E., Namata, G., Deshpande, A., Getoor, L.: Declarative analysis of noisy information networks. In: ICDE Workshops (2011)
Nguyen, D., Lenharth, A., Pingali, K.: A lightweight infrastructure for graph analytics. In: SOSP (2013)
Popescu, A.D., Balmin, A., Ercegovac, V., Ailamaki, A.: PREDIcT: towards predicting the runtime of large scale iterative analytics. In: Proceedings of VLDB Endowment (2013)
Pujol, J.M., Erramilli, V., Siganos, G., Xiaoyuan, Y., Laoutaris, N., Chhabra, P., Rodriguez, P.: The little engine(s) that could: scaling online social networks. In: SIGCOMM (2010)
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Redis. http://redis.io/
Roy, A., Mihailovic, I., Zwaenepoel, W.: X-stream: edge-centric graph processing using streaming partitions. In: SOSP (2013)
Salihoglu, S., Widom, J.: GPS: a graph processing system. In: SSDBM (2013)
Seo, J., Guo, S., Lam, M.S.: Socialite: datalog extensions for efficient social network analysis. In: ICDE (2013)
Seo, J., Park, J., Shin, J., Lam, M.S.: Distributed socialite: a datalog-based language for large-scale graph analysis. In: PVLDB (2013)
Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. In: VLDB (2008)
Shasha, D., Wang, J.T.L., Giugno, R.: Algorithmics and applications of tree and graph searching. In: PODS (2002)
Simmhan, Y., Kumbhare, A.G., Wickramaarachchi, C., Nagarkar, S., Ravi, S., Raghavendra, C.S., Prasanna, V.K.: Goffish: a sub-graph centric framework for large-scale graph analytics. In: CoRR (2013)
Stanford Network Analysis Project. https://snap.stanford.edu
Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From “Think Like a Vertex” to “Think Like a Graph”. In: PVLDB (2013)
Tian, Y., Patel, J.M.: TALE: a tool for approximate large graph matching. In: ICDE (2008)
Ullmann, J.R.: An algorithm for subgraph isomorphism. J. ACM 23, 31–42 (1976)
Article MathSciNet Google Scholar
Wang, G., Xie, W., Demers, A.J., Gehrke, J.: Asynchronous large-scale graph processing made easy. In: CIDR (2013)
Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structure-based approach. In: SIGMOD (2004)
Zhao, P., Yu, J.X., Yu, P.S.: Graph indexing: tree + delta less than equal to graph. In: VLDB (2007)
Zou, L., Chen, L., Yu, J.X., Lu, Y.: A novel spectral coding in a large graph database. In: EDBT (2008)

Download references

Author information

Authors and Affiliations

University of Maryland, College Park, MD, United States
Abdul Quamar, Amol Deshpande & Jimmy Lin

Authors

Abdul Quamar
View author publications
You can also search for this author in PubMed Google Scholar
Amol Deshpande
View author publications
You can also search for this author in PubMed Google Scholar
Jimmy Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdul Quamar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Quamar, A., Deshpande, A. & Lin, J. NScale: neighborhood-centric large-scale graph analytics in the cloud. The VLDB Journal 25, 125–150 (2016). https://doi.org/10.1007/s00778-015-0405-2

Download citation

Received: 07 December 2014
Revised: 02 June 2015
Accepted: 21 September 2015
Published: 13 October 2015
Issue Date: April 2016
DOI: https://doi.org/10.1007/s00778-015-0405-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

NScale: neighborhood-centric large-scale graph analytics in the cloud

Abstract

Access this article

Similar content being viewed by others

GoFFish: A Sub-graph Centric Framework for Large-Scale Graph Analytics

G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing

An analysis of the graph processing landscape

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

NScale: neighborhood-centric large-scale graph analytics in the cloud

Abstract

Access this article

Similar content being viewed by others

GoFFish: A Sub-graph Centric Framework for Large-Scale Graph Analytics

G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing

An analysis of the graph processing landscape

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation