research-article

Free Access

GraFlex: Flexible Graph Processing on FPGAs through Customized Scalable Interconnection Network

Authors:
Chunyou Su

The Hong Kong University of Science and Technology, Kowloon, Hong Kong

The Hong Kong University of Science and Technology, Kowloon, Hong Kong

0000-0002-8655-8536
View Profile

,
Linfeng Du

The Hong Kong University of Science and Technology, Kowloon, Hong Kong

The Hong Kong University of Science and Technology, Kowloon, Hong Kong

0000-0002-3007-4890
View Profile

,
Tingyuan Liang

The Hong Kong University of Science and Technology, Kowloon, Hong Kong

The Hong Kong University of Science and Technology, Kowloon, Hong Kong

0000-0002-0390-2320
View Profile

,
Zhe Lin

Sun Yat-sen University, ShenZhen, China

Sun Yat-sen University, ShenZhen, China

0009-0002-1594-2335
View Profile

,
Maolin Wang

ACCESS, New Territories, Hong Kong

ACCESS, New Territories, Hong Kong

0000-0001-7449-9834
View Profile

,
Sharad Sinha

Indian Institute of Technology Goa, Goa, India

Indian Institute of Technology Goa, Goa, India

0000-0002-4532-2017
View Profile

,
Wei Zhang

The Hong Kong University of Science and Technology, Kowloon, Hong Kong

The Hong Kong University of Science and Technology, Kowloon, Hong Kong

0000-0002-7622-6714
View Profile

FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate ArraysApril 2024Pages 143–153https://doi.org/10.1145/3626202.3637573

Published:02 April 2024Publication History

FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Pages 143–153

ABSTRACT

Graph processing system design has been widely considered to be a challenging topic due to the mismatch between the computational throughput requirement and the memory bandwidth. Recent works try to deliver better graph processing systems by taking advantage of application-specific architectures and emerging high-bandwidth memory on FPGAs. However, there is still ample room for improvements regarding flexibility, scalability, and usability. This paper presents GraFlex, a flexible scatter-gather graph processing framework on FPGAs with scalable interconnection networks. It adopts the Bulk-Synchronous Parallel (BSP) paradigm for global control and synchronization, enabling rapid deployment of performant graph processing systems through HLS-based design flows. GraFlex conducts software-hardware co-optimization to boost system performance. It configures the compact graph format, partition scheme, and memory channel allocation strategy to support scalable designs. Resource-efficient multi-stage butterfly interconnection network achieves on-device data communication and facilitates throughput matching. To handle fragmented memory requests, we propose coalesced memory access engines to improve bandwidth utilization. GraFlex is comprehensively evaluated with various graph applications and real-world datasets. Our results show up to 2.09\texttimes average speedup in traversal throughput over the existing state-of-the-art work with a non-negligible reduction in power and resource consumption. A case study of the breadth-first search (BFS) application shows a 6.58\texttimes speedup in average algorithm throughout with proper implementation choices enabled by the scatter-gather mechanism implemented. The BFS study also reports an almost linear throughput scaling versus the number of processing elements (PEs) and memory channels.

References

Osama G Attia, Tyler Johnson, Kevin Townsend, Philip Jones, and Joseph Zambreno. 2014. Cygraph: A reconfigurable architecture for parallel breadth-first search. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. IEEE, 228--235.Google ScholarDigital Library
Maciej Besta, Dimitri Stanojevic, Johannes De Fine Licht, Tal Ben-Nun, and Torsten Hoefler. 2019. Graph processing on fpgas: Taxonomy, survey, challenges. arXiv preprint arXiv:1903.06697 (2019).Google Scholar
Brahim Betkaoui, Yu Wang, David B Thomas, and Wayne Luk. 2012. A reconfigurable computing approach for efficient and scalable parallel graph exploration. In 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors. IEEE, 8--15.Google ScholarDigital Library
Xinyu Chen, Ronak Bajaj, Yao Chen, Jiong He, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2019. On-the-fly parallel data shuffling for graph processing on OpenCL-based FPGAs. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 67--73.Google ScholarCross Ref
Xinyu Chen, Yao Chen, Feng Cheng, Hongshi Tan, Bingsheng He, and Weng-Fai Wong. 2022. ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1342--1358.Google Scholar
Xinyu Chen, Feng Cheng, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2022. ThunderGP: resource-efficient graph processing framework on FPGAs with HLS. ACM Transactions on Reconfigurable Technology and Systems 15, 4 (2022), 1--31.Google ScholarDigital Library
Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2021. ThunderGP: HLS-based graph processing framework on FPGAs. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 69--80.Google ScholarDigital Library
Yuze Chi, Licheng Guo, and Jason Cong. 2022. Accelerating SSSP for powerlaw graphs. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 190--200.Google ScholarDigital Library
Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, and Huazhong Yang. 2017. ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 217--226.Google ScholarDigital Library
William James Dally and Brian Patrick Towles. 2004. Principles and practices of interconnection networks. Elsevier.Google ScholarDigital Library
Jonas Dann, Daniel Ritter, and Holger Fröning. 2022. GraphScale: Scalable bandwidth-efficient graph processing on FPGAs. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 24--32.Google ScholarCross Ref
Jonas Dann, Daniel Ritter, and Holger Fröning. 2023. GraphScale: Scalable Processing on FPGAs for HBM and Large Graphs. ACM Transactions on Reconfigurable Technology and Systems (2023).Google Scholar
Andrew Davidson, Sean Baxter, Michael Garland, and John D Owens. 2014. Workefficient parallel GPU methods for single-source shortest paths. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 349--359.Google ScholarDigital Library
Frederik Michel Dekking, Cornelis Kraaikamp, Hendrik Paul Lopuhaä, and Ludolf Erwin Meester. 2005. A Modern Introduction to Probability and Statistics: Understanding why and how. Vol. 488. Springer.Google Scholar
Laxman Dhulipala, Guy E Blelloch, and Julian Shun. 2021. Theoretically efficient parallel graph algorithms can be fast and scalable. ACM Transactions on Parallel Computing (TOPC) 8, 1 (2021), 1--70.Google ScholarDigital Library
Eric Finnerty, Zachary Sherer, Hang Liu, and Yan Luo. 2019. Dr. BFS: Data centric breadth-first search on FPGAs. In Proceedings of the 56th Annual Design Automation Conference 2019. 1--6.Google ScholarDigital Library
Paul Grigoras, Pavel Burovskiy,Wayne Luk, and Spencer Sherwin. 2016. Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA. In 2016 26th international conference on field programmable logic and applications (FPL). IEEE, 1--9.Google ScholarCross Ref
Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating graph linear algebra on HBM-equipped FPGAs. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1--9.Google ScholarDigital Library
Xilinx Inc. 2020. Vitis High-Level Synthesis User Guide. https://docs.xilinx.com/ r/2020.2-English/ug1399-vitis-hls/Google Scholar
Abhishek Kumar Jain, Hossein Omidian, Henri Fraisse, Mansimran Benipal, Lisa Liu, and Dinesh Gaitonde. 2020. A domain-specific architecture for accelerating sparse matrix vector multiplication on fpgas. In 2020 30th International conference on field-programmable logic and applications (FPL). IEEE, 127--132.Google ScholarCross Ref
Wole Jaiyeoba, Nima Elyasi, Changho Choi, and Kevin Skadron. 2023. ACTS: A Near-Memory FPGA Graph Processing Framework. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 79--89.Google ScholarDigital Library
Vasiliki Kalavri, Vladimir Vlassov, and Seif Haridi. 2017. High-level programming abstractions for distributed graph processing. IEEE Transactions on Knowledge and Data Engineering 30, 2 (2017), 305--324.Google ScholarCross Ref
Jeremy Kepner, Peter Aaltonen, David Bader, Aydin Buluç, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, et al. 2016. Mathematical foundations of the GraphBLAS. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--9.Google ScholarCross Ref
Guoqing Lei, Yong Dou, Rongchun Li, and Fei Xia. 2015. An FPGA implementation for solving the large single-source-shortest-path problem. IEEE Transactions on Circuits and Systems II: Express Briefs 63, 5 (2015), 473--477.Google ScholarCross Ref
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.Google Scholar
Kexin Li, Chenhao Liu, Zhiyuan Shao, Zeke Wang, Minkang Wu, Jiajie Chen, Xiaofei Liao, and Hai Jin. 2021. ScalaBFS: A Scalable BFS Accelerator on HBMEnhanced FPGAs. arXiv preprint arXiv:2105.11754 (2021).Google Scholar
Hang Liu and H Howie Huang. 2015. Enterprise: Breadth-first graph traversal on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.Google ScholarDigital Library
Amir Hossein Nodehi Sabet, Junqiao Qiu, and Zhijia Zhao. 2018. Tigr: Transforming irregular graphs for gpu-friendly graph processing. ACM SIGPLAN Notices 53, 2 (2018), 622--636.Google ScholarDigital Library
Ryan Rossi and Nesreen Ahmed. 2015. The network data repository with interactive graph analytics and visualization. In Proceedings of the AAAI conference on artificial intelligence, Vol. 29.Google ScholarCross Ref
Zhiyuan Shao, Ruoshi Li, Diqing Hu, Xiaofei Liao, and Hai Jin. 2019. Improving performance of graph processing on FPGA-DRAM platform by two-level vertex caching. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 320--329.Google ScholarDigital Library
Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, and John McPherson. 2013. From" think like a vertex" to" think like a graph". Proceedings of the VLDB Endowment 7, 3 (2013), 193--204.Google ScholarDigital Library
Leslie G Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (1990), 103--111.Google ScholarDigital Library
Zeke Wang, Hongjing Huang, Jie Zhang, and Gustavo Alonso. 2020. Shuhai: Benchmarking high bandwidth memory on fpgas. In 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 111--119.Google ScholarCross Ref
Shijie Zhou, Charalampos Chelmis, and Viktor K Prasanna. 2015. Accelerating large-scale single-source shortest path on FPGA. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, 129--136.Google ScholarDigital Library
Shijie Zhou, Rajgopal Kannan, Viktor K Prasanna, Guna Seetharaman, and Qing Wu. 2019. Hitgraph: High-throughput graph processing framework on fpga. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2249--2264.Google ScholarCross Ref

Index Terms

GraFlex: Flexible Graph Processing on FPGAs through Customized Scalable Interconnection Network

Recommendations

GraphScale: Scalable Processing on FPGAs for HBM and Large Graphs
Recent advances in graph processing on FPGAs promise to alleviate performance bottlenecks with irregular memory access patterns. Such bottlenecks challenge performance for a growing number of important application areas like machine learning and data ...
Read More
ThunderGP: Resource-Efficient Graph Processing Framework on FPGAs with HLS
FPGA has been an emerging computing infrastructure in datacenters benefiting from fine-grained parallelism, energy efficiency, and reconfigurability. Meanwhile, graph processing has attracted tremendous interest in data analytics, and its performance is ...
Read More
GraphCube: Interconnection Hierarchy-aware Graph Processing
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Processing large-scale graphs with billions to trillions of edges requires efficiently utilizing parallel systems. However, current graph processing engines do not scale well beyond a few tens of computing nodes because they are oblivious to the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays
April 2024
300 pages
ISBN:9798400704185
DOI:10.1145/3626202
General Chair:
Zhiru Zhang
Cornell University, USA
,
Program Chair:
Andrew Putnam
Microsoft, USA
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 April 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
Author Tags
fpga
graph processing
interconnection network
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate125of627submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 96
  Total Downloads
- Downloads (Last 12 months)96
- Downloads (Last 6 weeks)96
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GraFlex: Flexible Graph Processing on FPGAs through Customized Scalable Interconnection Network

FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

GraphScale: Scalable Processing on FPGAs for HBM and Large Graphs

ThunderGP: Resource-Efficient Graph Processing Framework on FPGAs with HLS

GraphCube: Interconnection Hierarchy-aware Graph Processing