ABSTRACT
Processing large-scale graphs with billions to trillions of edges requires efficiently utilizing parallel systems. However, current graph processing engines do not scale well beyond a few tens of computing nodes because they are oblivious to the communication cost variations across the interconnection hierarchy. We introduce GraphCube, a better approach to optimizing graph processing on large-scale parallel systems with complex interconnections. GraphCube features a new graph partitioning approach to achieve better load balancing and minimize communication overhead across multiple levels of the interconnection hierarchy. We evaluate GraphCube by applying it to fundamental graph operations performed on synthetic and real-world graph datasets. Our evaluation used up to 79,024 computing nodes and 1.2+ million processor cores. Our large-scale experiments show that GraphCube outperforms state-of-the-art parallel graph processing methods in throughput and scalability. Furthermore, GraphCube outperformed the top-ranked systems on the Graph 500 list.
Supplemental Material
Available for Download
Supplemental material.
- Zainab Abbas, Vasiliki Kalavri, Paris Carbone, and Vladimir Vlassov. 2018. Streaming graph partitioning: an experimental study. Proceedings of the VLDB Endowment 11, 11 (2018), 1590--1603.Google ScholarDigital Library
- Soramichi Akiyama. 2020. Assessing Impact of Data Partitioning for Approximate Memory in C/C++ Code. arXiv preprint arXiv:2004.01637 (2020).Google Scholar
- Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--10.Google ScholarDigital Library
- Huanqi Cao, Yuanwei Wang, Haojie Wang, Heng Lin, Zixuan Ma, Wanwang Yin, and Wenguang Chen. 2022. Scaling graph traversal to 281 trillion edges with 40 million cores. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 234--245.Google ScholarDigital Library
- Donglin Chen, Jianbin Fang, Shizhao Chen, Chuanfu Xu, and Zheng Wang. 2019. Optimizing sparse matrix-vector multiplications on an armv8-based many-core architecture. International Journal of Parallel Programming 47, 3 (2019), 418--432.Google ScholarDigital Library
- Donglin Chen, Jianbin Fang, Chuanfu Xu, Shizhao Chen, and Zheng Wang. 2020. Characterizing scalability of sparse matrix-vector multiplications on phytium ft-2000+. International Journal of Parallel Programming 48, 1 (2020), 80--97.Google ScholarDigital Library
- R. Chen, J. Shi, Y. Chen, and H. Chen. 2015. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. European Conference on Computer Systems (2015), 1--15.Google Scholar
- Rong Chen, Jiaxin Shi, Yanzhe Chen, Binyu Zang, Haibing Guan, and Haibo Chen. 2018. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs. ACM Trans. Parallel Comput. 5, 3 (2018), 13:1--13:39.Google ScholarDigital Library
- Rong Chen, Jiaxin Shi, Yanzhe Chen, Binyu Zang, Haibing Guan, and Haibo Chen. 2019. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. ACM Transactions on Parallel Computing (TOPC) 5, 3 (2019), 1--39.Google ScholarDigital Library
- Yongzhi Chen and Yuefan Deng. 2009. A detailed analysis of communication load balance on BlueGene supercomputer. Computer physics communications 180, 8 (2009), 1251--1258.Google Scholar
- William James Dally and Brian Patrick Towles. 2004. Principles and practices of interconnection networks. Elsevier.Google ScholarDigital Library
- Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden, Marc Snir, and Keshav Pingali. 2018. Gluon: A communication-optimizing substrate for distributed heterogeneous graph analytics. In Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation. 752--768.Google ScholarDigital Library
- Timothy A Davis. 2006. Direct methods for sparse linear systems. SIAM.Google Scholar
- Jack Dongarra. 2020. Report on the Fujitsu Fugaku system. University of Tennessee-Knoxville Innovative Computing Laboratory, Tech. Rep. ICLUT-20-06 (2020).Google Scholar
- Wenfei Fan, Tao He, Longbin Lai, Xue Li, Yong Li, Zhao Li, Zhengping Qian, Chao Tian, Lei Wang, Jingbo Xu, et al. 2021. GraphScope: a unified engine for big graph processing. Proceedings of the VLDB Endowment 14, 12 (2021), 2879--2892.Google ScholarDigital Library
- Wenfei Fan, Muyang Liu, Chao Tian, Ruiqi Xu, and Jingren Zhou. 2020. Incrementalization of graph partitioning algorithms. Proceedings of the VLDB Endowment 13, 8 (2020), 1261--1274.Google ScholarDigital Library
- Wenfei Fan, Ruiqi Xu, Qiang Yin, Wenyuan Yu, and Jingren Zhou. 2022. Application-driven graph partitioning. The VLDB Journal (2022), 1--24.Google Scholar
- Jianbin Fang, Peng Zhang, Chun Huang, Tao Tang, Kai Lu, Ruibo Wang, and Zheng Wang. 2022. Programming Bare-Metal Accelerators with Heterogeneous Threading Models: A Case Study of Matrix-3000. arXiv preprint arXiv:2210.12230 (2022).Google Scholar
- Xing Feng, Lijun Chang, Xuemin Lin, Lu Qin, and Wenjie Zhang. 2016. Computing connected components with linear communication cost in pregel-like systems. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE, 85--96.Google ScholarCross Ref
- Zhisong Fu, Michael Personick, and Bryan Thompson. 2014. Map-Graph: A high level API for fast development of high performance graph analytics on GPUs. In Proceedings of workshop on GRAph data management experiences and systems. 1--6.Google ScholarDigital Library
- Pablo Fuentes, Mariano Benito, Enrique Vallejo, José Luis Bosque, Ramön Beivide, Andreea Anghel, Germán Rodríguez, Mitch Gusat, Cyriel Minkenberg, and Mateo Valero. 2017. A scalable synthetic traffic model of Graph500 for computer networks analysis. Concurrency and Computation: Practice and Experience 29, 24 (2017), e4231.Google ScholarCross Ref
- Xinbiao Gan, Yiming Zhang, Ruibo Wang, Tiejun Li, Tiaojie Xiao, Ruigeng Zeng, Jie Liu, and Kai Lu. 2021. TianheGraph: Customizing Graph Search for Graph500 on Tianhe Supercomputer. IEEE Transactions on Parallel and Distributed Systems (2021).Google Scholar
- Xinbiao Gan, Yiming Zhang, Ruigeng Zeng, Jie Liu, Ruibo Wang, Tiejun Li, Li Chen, and Kai Lu. 2022. XTree: Traversal-Based Partitioning for Extreme-Scale Graph Processing on Supercomputers. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2046--2059.Google Scholar
- Tao Gao, Yutong Lu, Baida Zhang, and Guang Suo. 2014. Using the intel many integrated core to accelerate graph traversal. The International journal of high performance computing applications 28, 3 (2014), 255--266.Google Scholar
- Sayan Ghosh, Nathan R Tallent, and Mahantesh Halappanavar. 2021. Characterizing Performance of Graph Neighborhood Communication Patterns. IEEE Transactions on Parallel and Distributed Systems 33, 4 (2021), 915--928.Google Scholar
- Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. 17--30.Google ScholarDigital Library
- Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation. 599--613.Google Scholar
- Samuel Grossman, Heiner Litz, and Christos Kozyrakis. 2018. Making pull-based graph processing performant. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 246--260.Google ScholarDigital Library
- http://graph500.org/. 2021. The Graph 500 List. https://graph500.org/ Last accessed 03 March 2022.Google Scholar
- Eu Inc. 2022. url.eu-2015. https://law.di.unimi.it/webdata/eu-2015/ Last accessed 03 December 2022.Google Scholar
- Twitter Inc. 2021. twitter-2010. https://law.di.unimi.it/webdata/twitter-2010/ Last accessed 03 December 2021.Google Scholar
- Keita Iwabuchi, Hitoshi Sato, Yuichiro Yasui, Katsuki Fujisawa, and Satoshi Matsuoka. 2014. NVM-based hybrid BFS with memory efficient data structure. In 2014 IEEE International Conference on Big Data (Big Data). IEEE, 529--538.Google ScholarCross Ref
- George Karypis and Vipin Kumar. 1995. METIS-unstructured graph partitioning and sparse matrix ordering system, version 2.0. (1995).Google Scholar
- George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359--392.Google ScholarDigital Library
- George Karypis, Kirk Schloegel, and Vipin Kumar. 1997. Parmetis: Parallel graph partitioning and sparse matrix ordering library. (1997).Google Scholar
- John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In 2008 International Symposium on Computer Architecture. IEEE, 77--88.Google ScholarDigital Library
- Deyu Kong, Xike Xie, and Zhuoxu Zhang. 2022. Clustering-based Partitioning for Large Web Graphs. arXiv preprint arXiv:2201.00472 (2022).Google Scholar
- Dongsheng Li, Yiming Zhang, Jinyan Wang, and KianLee Tan. 2019. TopoX: Topology Refactorization for Efficient Graph Partitioning and Processing. PVLDB 12, 8 (2019), 891--905.Google ScholarDigital Library
- Dongsheng Li, Yiming Zhang, Jinyan Wang, and Kian-Lee Tan. 2019. TopoX: Topology refactorization for efficient graph partitioning and processing. Proceedings of the VLDB Endowment 12, 8 (2019), 891--905.Google ScholarDigital Library
- Z Li, C Wu, and Y Li. 2021. FEP-based large-scale virtual screening for effective drug discovery against COVID-19. In Int. Conf. High Performance Computing, Networking, Storage, and Analysis.Google Scholar
- Xiangke LIAO, Liquan XIAO, Canqun YANG, and Yutong LU. [n.d.]. MilkyWay-2 supercomputer: system and application. ([n. d.]).Google Scholar
- Xiang-Ke Liao, Zheng-Bin Pang, Ke-Fei Wang, Yu-Tong Lu, Min Xie, Jun Xia, De-Zun Dong, and Guang Suo. 2015. High performance interconnect network for Tianhe system. Journal of Computer Science and Technology 30, 2 (2015), 259--272.Google ScholarCross Ref
- Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, and Weimin Zheng. 2017. Scalable graph traversal on sunway taihulight with ten million cores. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 635--645.Google ScholarCross Ref
- Heng Lin, Xiaowei Zhu, Bowen Yu, Xiongchao Tang, Wei Xue, Wenguang Chen, Lufei Zhang, Torsten Hoefler, Xiaosong Ma, Xin Liu, et al. 2018. Shentu: processing multi-trillion edge graphs on millions of cores in seconds. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 706--716.Google ScholarDigital Library
- Yucheng Low. 2013. Graphlab: A distributed abstraction for large scale machine learning. University of California (2013).Google Scholar
- Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB 5, 8 (2012), 716--727.Google ScholarDigital Library
- Kai Lu, Yaohua Wang, Yang Guo, Chun Huang, Sheng Liu, Ruibo Wang, Jianbin Fang, Tao Tang, Zhaoyun Chen, Biwei Liu, et al. 2022. MT-3000: a heterogeneous multi-zone processor for HPC. CCF Transactions on High Performance Computing (2022), 1--15.Google Scholar
- Meilian Lu, Zhenglin Zhang, Zhihe Qu, and Yu Kang. 2018. LPANNI: Overlapping community detection using label propagation in large-scale complex networks. IEEE Transactions on Knowledge and Data Engineering 31, 9 (2018), 1736--1749.Google ScholarDigital Library
- Yutong Lu. 2019. Paving the way for China exascale computing. CCF Transactions on High Performance Computing 1, 2 (2019), 63--72.Google ScholarCross Ref
- Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2009. Pregel: a system for large-scale graph processing. Sigmod (2009), 135--146.Google Scholar
- Ruben Mayer and Hans-Arno Jacobsen. 2021. Hybrid edge partitioner: partitioning large power-law graphs under memory constraints. In Proceedings of the 2021 International Conference on Management of Data. 1289--1302.Google ScholarDigital Library
- Andrey Molyakov. 2019. Age of Great Chinese Dragon: Supercomputer Centers and High Performance Computing. Journal of Electrical and Electronic Engineering 7, 4 (2019), 87--94.Google Scholar
- Masahiro Nakao, Koji Ueno, Katsuki Fujisawa, Yuetsu Kodama, and Mitsuhisa Sato. 2020. Performance Evaluation of Supercomputer Fugaku using Breadth-First Search Benchmark in Graph500. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 408--409.Google ScholarCross Ref
- Masahiro Nakao, Koji Ueno, Katsuki Fujisawa, Yuetsu Kodama, and Mitsuhisa Sato. 2021. Performance of the supercomputer fugaku for breadth-first search in graph500 benchmark. In High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24--July 2, 2021, Proceedings 36. Springer, 372--390.Google ScholarDigital Library
- Mahdi Nikdast, Jiang Xu, Luan HK Duong, Xiaowen Wu, Zhehui Wang, Xuan Wang, and Zhe Wang. 2014. Fat-tree-based optical interconnection networks under crosstalk noise constraint. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23, 1 (2014), 156--169.Google ScholarDigital Library
- Joel Nishimura and Johan Ugander. 2013. Restreaming graph partitioning: simple versatile algorithms for advanced balancing. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 1106--1114.Google ScholarDigital Library
- Anil Pacaci and M Tamer Özsu. 2019. Experimental analysis of streaming algorithms for graph partitioning. In Proceedings of the 2019 International Conference on Management of Data. 1375--1392.Google ScholarDigital Library
- Julian Shun and Guy E Blelloch. 2013. Ligra: a lightweight graph processing framework for shared memory. In ACM Sigplan Notices, Vol. 48. ACM, 135--146.Google ScholarDigital Library
- George M Slota, Cameron Root, Karen Devine, Kamesh Madduri, and Sivasankaran Rajamanickam. 2020. Scalable, multi-constraint, complex-objective graph partitioning. IEEE Transactions on Parallel and Distributed Systems 31, 12 (2020), 2789--2801.Google ScholarDigital Library
- Hari Subramoni, Albert Mathews Augustine, Mark Arnold, Jonathan Perkins, Xiaoyi Lu, Khaled Hamidouche, and Dhabaleswar K Panda. 2016. INAM 2: InfiniBand Network Analysis and Monitoring with MPI. In International Conference on High Performance Computing. Springer, 300--320.Google Scholar
- Toyotaro Suzumura, Koji Ueno, Hitoshi Sato, Katsuki Fujisawa, and Satoshi Matsuoka. 2011. Performance characteristics of Graph500 on large-scale distributed environment. In 2011 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 149--158.Google ScholarDigital Library
- Koji Ueno and Toyotaro Suzumura. 2012. 2d partitioning based graph search for the graph500 benchmark. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, 1925--1931.Google ScholarDigital Library
- Koji Ueno and Toyotaro Suzumura. 2012. Highly scalable graph search for the graph500 benchmark. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing. 149--160.Google ScholarDigital Library
- Koji Ueno, Toyotaro Suzumura, Naoya Maruyama, Katsuki Fujisawa, and Satoshi Matsuoka. 2016. Extreme scale breadth-first search on supercomputers. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 1040--1047.Google ScholarCross Ref
- Carnegie Mellon University. 2021. ClueWeb12 Dataset. https://lemurproject.org/clueweb12/ Last accessed 03 December 2021.Google Scholar
- Erik Vermij, Leandro Fiorin, Christoph Hagleitner, and Koen Bertels. 2017. Boosting the efficiency of HPCG and Graph500 with near-data processing. In 2017 46th International Conference on Parallel Processing (ICPP). IEEE, 31--40.Google ScholarCross Ref
- Ruibo Wang, Kai Lu, Juan Chen, Wenzhe Zhang, Jinwen Li, Yuan Yuan, Pingjing Lu, Libo Huang, Shengguo Li, and Xiaokang Fan. 2020. Brief introduction of TianHe exascale prototype system. Tsinghua Science and Technology 26, 3 (2020), 361--369.Google ScholarCross Ref
- Yuanwei Wang, Huanqi Cao, Zixuan Ma, Wanwang Yin, and Wenguang Chen. 2022. Scaling graph 500 SSSP to 140 trillion edges with over 40 million cores. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 248--262.Google ScholarCross Ref
- Min Xie, Yutong Lu, Kefei Wang, Lu Liu, Hongjia Cao, et al. 2011. Tianhe-1a interconnect and message-passing services. IEEE Micro 32, 1 (2011), 8--20.Google Scholar
- Xue-Jun Yang, Xiang-Ke Liao, Kai Lu, Qing-Feng Hu, Jun-Qiang Song, and Jin-Shu Su. 2011. The TianHe-1A supercomputer: its hardware and software. Journal of computer science and technology 26, 3 (2011), 344--351.Google ScholarCross Ref
- Yuichiro Yasui and Katsuki Fujisawa. 2015. Fast and scalable NUMA-based thread parallel breadth-first search. In 2015 International Conference on High Performance Computing & Simulation (HPCS). IEEE, 377--385.Google ScholarCross Ref
- Yuichiro Yasui, Katsuki Fujisawa, and Kazushige Goto. 2013. NUMA-optimized parallel breadth-first search on multicore single-node system. In 2013 IEEE International Conference on Big Data. IEEE, 394--402.Google ScholarCross Ref
- Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek. 2005. A scalable distributed parallel breadth-first search algorithm on BlueGene/L. In SC'05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. IEEE, 25--25.Google ScholarDigital Library
- Jason Jongjin Yoo and Alan E Willner. 2001. A Performance and Implementation Comparison of Bidirectional and Dual Bus 2-D WDM Multiple-Plane Optical Interconnections with Row-ColumnMultihop Network Structures. Journal of lightwave technology 19, 6 (2001), 801.Google ScholarCross Ref
- Xin You, Hailong Yang, Zhongzhi Luan, Yi Liu, and Depei Qian. 2019. Performance evaluation and analysis of linear algebra kernels in the prototype tianhe-3 cluster. In Asian Conference on Supercomputing Frontiers. Springer, 86--105.Google ScholarCross Ref
- Jeffrey Young, Julian Romera, Matthias Hauck, and Holger Fröning. 2016. Optimizing communication for a 2D-partitioned scalable BFS. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.Google ScholarCross Ref
- Chenglong Zhang. 2020. A New Perspective of Graph Data and A Generic and Efficient Method for Large Scale Graph Data Traversal. arXiv preprint arXiv:2009.07463 (2020).Google Scholar
- Hongyang Zhang, Peter Lofgren, and Ashish Goel. 2016. Approximate personalized pagerank on dynamic graphs. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1315--1324.Google ScholarDigital Library
- Yiming Zhang, Haonan Wang, Menghan Jia, Jinyan Wang, Dong sheng Li, Guangtao Xue, and K. Tan. 2020. TopoX: Topology Refactorization for Minimizing Network Communication in Graph Computations. IEEE/ACM Transactions on Networking 28 (2020), 2768--2782.Google ScholarDigital Library
- Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. 2016. Gemini: A Computation-Centric Distributed Graph Processing System. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2--4, 2016, Kimberly Keeton and Timothy Roscoe (Eds.). USENIX Association, 301--316. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/zhuGoogle Scholar
Index Terms
- GraphCube: Interconnection Hierarchy-aware Graph Processing
Recommendations
FT-topo: Architecture-Driven Folded-Triangle Partitioning for Communication-efficient Graph Processing
ICS '23: Proceedings of the 37th International Conference on SupercomputingAs graph size (numbers of vertices and edges) is increasing from billions to trillions, efficient graph processing requires exascale computing clusters, which consist of hundreds of thousands of nodes connected via hierarchical networks with multiple ...
Partitioning of a graph into induced subgraphs not containing prescribed cliques
AbstractLet K p be a complete graph of order p ≥ 2. A K p-free k-coloring of a graph H is a partition of V ( H ) into V 1 , V 2 … , V k such that H [ V i ] does not contain K p for each i ≤ k. In 1977 Borodin and Kostochka conjectured that any graph H ...
Partitioning a Graph into Complementary Subgraphs
WALCOM: Algorithms and ComputationAbstractIn the Partition Into Complementary Subgraphs (Comp-Sub) problem we are given a graph , and an edge set property , and asked whether G can be decomposed into two graphs, H and its complement , for some graph H, in such a way that the edge cut-set (...
Comments