skip to main content
10.1145/3559009.3569690acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Optimizing Aggregate Computation of Graph Neural Networks with on-GPU Interpreter-Style Programming

Published:27 January 2023Publication History

ABSTRACT

Graph Neural Networks (GNNs) generalize deep learning to graph-structured data and show great success in many tasks. However, their irregular aggregation kernels make them inefficient on GPUs. The unpredictable control flow and memory references of irregular kernels prohibit most optimizations designed for regular ones. For example, even if the nodes have overlapped neighbors, reusing them via shared memory is non-trivial, as the neighborhoods used are runtime information. This paper presents regGNN, an aggregation implementation that can benefit from the optimizations designed for regular kernels. It proposes a concept named "semi-regular" to describe the aggregate computation: the irregularity only comes from the neighborhood traversal; aggregating the high-dimensional vectors, which dominates the computation, is data-independent and thus incurs no irregularity. regGNN encodes the aggregate computation steps of each thread block into an aggregate script, which replaces the graph as an input of the GPU kernel. The GPU kernel is like an interpreter, and the aggregate script can be regarded as written in a simple GPU scripting language. The optimizations designed for regular kernels can then be applied to the aggregate script, as it is static and regular. regGNN demonstrates three optimizations: (1) intelligently scheduling nodes and customizing shared memory replacement to maximize data reuse, (2) reassigning nodes among warps for load balancing, and (3) aligning the aggregate script to improve memory latency hiding. Compared with the state-of-the-art GNN frameworks, regGNN achieves 2.81× throughput on average for moderate-scale GNNs. The speedup increases to 5.21× for GNNs with small hidden sizes and 100s × for deep GNNs.

References

  1. AMD. 2022. Open Source Platform for HPC and Ultrascale GPU Computing. https://rocmdocs.amd.com/en/latest/Google ScholarGoogle Scholar
  2. Tal Ben-Nun, Ely Levy, Amnon Barak, and Eri Rubin. 2015. Memory access patterns: The missing piece of the multi-GPU puzzle. In SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798--1828.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Paolo Boldi, Massimo Santini, and Sebastiano Vigna. 2008. A large time-aware web graph. In ACM SIGIR Forum, Vol. 42. ACM New York, NY, USA, 33--38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In 2012 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 141--151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shuai Che, Bradford M Beckmann, Steven K Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In 2013 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 185--195.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jie Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  8. Jianfei Chen, Jun Zhu, and Le Song. 2018. Stochastic Training of Graph Convolutional Networks with Variance Reduction. In International Conference on Machine Learning. PMLR, 942--950.Google ScholarGoogle Scholar
  9. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578--594.Google ScholarGoogle Scholar
  10. Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. 2019. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 257--266.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. cudabest C CUDA. [n. d.]. Best Practices Guide; CUDA Toolkit Documentation.Google ScholarGoogle Scholar
  12. Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428 (2019).Google ScholarGoogle Scholar
  13. Swapnil Gandhi and Anand Padmanabha Iyer. 2021. P3: Distributed deep graph learning at scale. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 551--568.Google ScholarGoogle Scholar
  14. Tong Geng, Ang Li, Runbin Shi, Chunshu Wu, Tianqi Wang, Yanfei Li, Pouya Haghi, Antonino Tumeo, Shuai Che, Steve Reinhardt, et al. 2020. AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 922--936.Google ScholarGoogle ScholarCross RefCross Ref
  15. Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies. SIAM journal on Applied Mathematics 17, 2 (1969), 416--429.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kshitij Gupta, Jeff A Stuart, and John D Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. IEEE.Google ScholarGoogle Scholar
  17. Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar
  18. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118--22133.Google ScholarGoogle Scholar
  19. Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2020. Improving the accuracy, scalability, and performance of graph neural networks with roc. Proceedings of Machine Learning and Systems 2 (2020), 187--198.Google ScholarGoogle Scholar
  20. George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing 20, 1 (1998), 359--392.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google ScholarGoogle Scholar
  22. Süreyya Emre Kurt, Aravind Sukumaran-Rajam, Fabrice Rastello, and Ponnuswamy Sadayyapan. 2020. Efficient tiled sparse matrix multiplication through matrix signatures. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. 2019. NeuGraph: Parallel Deep Neural Network Computation on Large Graphs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 443--458.Google ScholarGoogle Scholar
  24. Jason Mohoney, Roger Waleffe, Henry Xu, Theodoros Rekatsinas, and Shivaram Venkataraman. 2021. Marius: Learning Massive Graph Embeddings on a Single Machine. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 533--549.Google ScholarGoogle Scholar
  25. Steven Muchnick et al. 1997. Advanced compiler design implementation. Morgan kaufmann.Google ScholarGoogle Scholar
  26. NVIDIA. 2021. CUDA Basic Linear Algebra Subroutine library. https://docs.nvidia.com/cuda/cublas/index.htmlGoogle ScholarGoogle Scholar
  27. NVIDIA. 2022. NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-computeGoogle ScholarGoogle Scholar
  28. Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. 2021. ImageNet-21K Pretraining for the Masses. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.Google ScholarGoogle Scholar
  29. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211--252.Google ScholarGoogle Scholar
  30. Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European semantic web conference. Springer, 593--607.Google ScholarGoogle Scholar
  31. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  32. Vasily Volkov. 2016. Understanding latency hiding on GPUs. University of California, Berkeley.Google ScholarGoogle Scholar
  33. Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™. Springer, 167--188.Google ScholarGoogle Scholar
  34. Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. arXiv preprint arXiv:1909.01315 (2019).Google ScholarGoogle Scholar
  35. Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN symposium on principles and practice of parallel programming. 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yuke Wang, Boyuan Feng, and Yufei Ding. 2022. QGTC: accelerating quantized graph neural networks via GPU tensor core. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 107--119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding. 2021. GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 515--531.Google ScholarGoogle Scholar
  38. Ziheng Wang. 2020. SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 31--42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  40. Mingyu Yan, Lei Deng, Xing Hu, Ling Liang, Yujing Feng, Xiaochun Ye, Zhimin Zhang, Dongrui Fan, and Yuan Xie. 2020. Hygcn: A gcn accelerator with hybrid architecture. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 15--29.Google ScholarGoogle ScholarCross RefCross Ref
  41. Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 974--983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural networks. Advances in neural information processing systems 31 (2018).Google ScholarGoogle Scholar
  43. Mingxing Zhang, Yongwei Wu, Kang Chen, Xuehai Qian, Xue Li, and Weimin Zheng. 2016. Exploring the hidden dimension in graph processing. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 285--300.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Marinka Zitnik, Monica Agrawal, and Jure Leskovec. 2018. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34, 13 (2018), i457--i466.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Optimizing Aggregate Computation of Graph Neural Networks with on-GPU Interpreter-Style Programming

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
        October 2022
        569 pages
        ISBN:9781450398688
        DOI:10.1145/3559009

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 January 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate121of471submissions,26%

        Upcoming Conference

        PACT '24
        International Conference on Parallel Architectures and Compilation Techniques
        October 14 - 16, 2024
        Southern California , CA , USA
      • Article Metrics

        • Downloads (Last 12 months)60
        • Downloads (Last 6 weeks)3

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader