skip to main content
10.1145/3581784.3607049acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections

Itoyori: Reconciling Global Address Space and Global Fork-Join Task Parallelism

Published:11 November 2023Publication History

ABSTRACT

This paper introduces Itoyori, a task-parallel runtime system designed to tackle the challenge of scaling task parallelism (more specifically, nested fork-join parallelism) beyond a single node. The partitioned global address space (PGAS) model is often employed in task-parallel systems, but naively combining them can lead to poor performance due to fine-grained and redundant remote memory accesses. Itoyori addresses this issue by automatically caching global memory accesses at runtime, enabling efficient cache sharing among parallel tasks running on the same processor. As a real-world case study, we ported an existing task-parallel implementation of the Fast Multipole Method (FMM) to distributed memory with Itoyori and achieved a 7.5× speedup when scaled from a single node to 12 nodes and up to 6.0× faster performance than without caching. This study demonstrates that global-view fork-join programming can be made practical and scalable, while requiring minimal changes to the shared-memory code.

References

  1. Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2000. The Data Locality of Work Stealing. In Proceedings of the Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures (Bar Harbor, Maine, USA) (SPAA '00). 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sarita V. Adve and Mark D. Hill. 1990. Weak Ordering - A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, USA) (ISCA '90). 2--14.Google ScholarGoogle Scholar
  3. Shigeki Akiyama and Kenjiro Taura. 2015. Uni-Address Threads: Scalable Thread Management for RDMA-Based Work Stealing. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC '15). 15--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Shigeki Akiyama and Kenjiro Taura. 2016. Scalable Work Stealing of Native Threads on an x86-64 Infiniband Cluster. Journal of Information Processing 24, 3 (May 2016), 583--596.Google ScholarGoogle ScholarCross RefCross Ref
  5. Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2009. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Proceedings of the 15th International European Conference on Parallel and Distributed Computing (Delft, The Netherlands) (Euro-Par '09). 863--874.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. 2008. The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems 20, 3 (June 2008), 404--418.Google ScholarGoogle Scholar
  7. John Bachan, Scott Baden, Dan Bonachea, Johnny Corbino, Johnathan Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian Van Straalen, and Daniel Waters. 2022. UPC++ v1.0 Programmer's Guide, Revision 2022.9.0. Technical Report LBNL-2001479. Lawrence Berkeley National Laboratory, USA.Google ScholarGoogle Scholar
  8. Ayon Basumallik and Rudolf Eigenmann. 2006. Optimizing Irregular Shared-Memory Applications for Distributed-Memory Systems. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, New York, USA) (PPoPP '06). 119--128.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Salt Lake City, Utah, USA) (SC '12). 66:1--66:11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. K. Bennett, J. B. Carter, and W. Zwaenepoel. 1990. Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence. In Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Seattle, Washington, USA) (PPOPP '90). 168--176.Google ScholarGoogle Scholar
  11. B.N. Bershad, M.J. Zekauskas, and W.A. Sawdon. 1993. The Midway Distributed Shared Memory System. In Digest of Papers. The 38th IEEE Computer Society International Conference (San Francisco, California, USA) (COMPCON Spring '93). 528--537.Google ScholarGoogle Scholar
  12. Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling Irregular Parallel Computations on Hierarchical Caches. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (San Jose, California, USA) (SPAA '11). 355--366.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (Padua, Italy) (SPAA '96). 297--308.Google ScholarGoogle Scholar
  14. Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. Dag-Consistent Distributed Shared Memory. In Proceedings of the 10th International Parallel Processing Symposium (Honolulu, Hawaii, USA) (IPPS '96). 132--141.Google ScholarGoogle Scholar
  15. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Santa Barbara, California, USA) (PPoPP '95). 207--216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM 46, 5 (Sept. 1999), 720--748.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, and Jack J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science & Engineering 15, 6 (2013), 36--45.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Javier Bueno, Luis Martinell, Alejandro Duran, Montse Farreras, Xavier Martorell, Rosa M Badia, Eduard Ayguade, and Jesús Labarta. 2011. Productive Cluster Programming with OmpSs. In Proceedings of the 17th International European Conference on Parallel and Distributed Computing (Bordeaux, France) (Euro-Par '11). 555--566.Google ScholarGoogle ScholarCross RefCross Ref
  19. Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. 2018. Efficient Distributed Memory Management with RDMA and Caching. Proceedings of the VLDB Endowment 11, 11 (July 2018), 1604--1617.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hannah Cartier, James Dinan, and D. Brian Larkins. 2021. Optimizing Work Stealing Communication with Structured Atomic Operations. In Proceedings of the 50th International Conference on Parallel Processing (Lemont, Illinois, USA) (ICPP '21). 36:1--36:10.Google ScholarGoogle Scholar
  21. Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel Programmability and the Chapel Language. International Journal of High Performance Computing Applications 21, 3 (2007), 291--312.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM for the PGAS Community. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model (New York, New York, USA) (PGAS '10). 1--3.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (San Diego, California, USA) (OOPSLA '05). 519--538.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ho-Ren Chuang, Robert Lyerly, Stefan Lankes, and Binoy Ravindran. 2020. Scaling Shared Memory Multiprocessing Applications in Non-Cache-Coherent Domains. In Proceedings of the 13th ACM International Systems and Storage Conference (Haifa, Israel) (SYSTOR '20). 13--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Salvatore Di Girolamo, Flavio Vella, and Torsten Hoefler. 2017. Transparent Caching for RMA Systems. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (Orlando, Florida, USA) (IPDPS '17). 1018--1027.Google ScholarGoogle ScholarCross RefCross Ref
  26. James Dinan, Sriram Krishnamoorthy, D. Brian Larkins, Jarek Nieplocha, and P. Sadayappan. 2008. Scioto: A Framework for Global-View Task Parallelism. In Proceedings of the 37th International Conference on Parallel Processing (Portland, Oregon, USA) (ICPP '08). 586--593.Google ScholarGoogle Scholar
  27. James Dinan, D. Brian Larkins, Ponnuswamy Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable Work Stealing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Portland, Oregon, USA) (SC '09). 53:1--53:11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick. 2005. UPC: Distributed Shared Memory Programming. John Wiley & Sons.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Wataru Endo, Shigeyuki Sato, and Kenjiro Taura. 2020. MENPS: A Decentralized Distributed Shared Memory Exploiting RDMA. In Proceedings of 2020 IEEE/ACM Fourth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (Virtual Event) (IPDRM '20). 9--16.Google ScholarGoogle ScholarCross RefCross Ref
  30. Michael P. Ferguson and Daniel Buettner. 2015. Caching Puts and Gets in a PGAS Language Runtime. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models (Washington, District of Columbia, USA) (PGAS '15). 13--24.Google ScholarGoogle Scholar
  31. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Montreal, Quebec, Canada) (PLDI '98). 212--223.Google ScholarGoogle Scholar
  32. Karl Fuerlinger, Tobias Fuchs, and Roger Kowalewski. 2016. DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorithms. In Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications (Sydney, NSW, Australia) (HPCC '16). 983--990.Google ScholarGoogle ScholarCross RefCross Ref
  33. Thierry Gautier, Xavier Besseron, and Laurent Pigeon. 2007. KAAPI: A Thread Scheduling Runtime System for Data Flow Computations on Cluster of Multi-Processors. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation (London, Ontario, Canada) (PASCO '07). 15--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. 1990. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, USA) (ISCA '90). 15--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sayan Ghosh, Yanfei Guo, Pavan Balaji, and Assefaw H. Gebremedhin. 2021. RMACXX: An Efficient High-Level C++ Interface over MPI-3 RMA. In Proceedings of the 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (Melbourne, Australia) (CCGrid '21). 143--155.Google ScholarGoogle Scholar
  36. Max Grossman, Vivek Kumar, Zoran Budimlić, and Vivek Sarkar. 2016. Integrating Asynchronous Task Parallelism with OpenSHMEM. In Proceedings of the Third Workshop on OpenSHMEM and Related Technologies (Baltimore, Maryland, USA) (OpenSHMEM '16). 3--17.Google ScholarGoogle ScholarCross RefCross Ref
  37. Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. 2009. Backtracking-Based Load Balancing. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Raleigh, North Carolina, USA) (PPoPP '09). 55--64.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. 1992. Compiling Fortran D for MIMD Distributed-Memory Machines. Commun. ACM 35, 8 (1992), 66--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Torsten Hoefler, James Dinan, Rajeev Thakur, Brian Barrett, Pavan Balaji, William Gropp, and Keith Underwood. 2015. Remote Memory Access Programming in MPI-3. ACM Transactions on Parallel Computing 2, 2 (July 2015), 1--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, Oregon, USA) (PGAS '14). 6:1--6:11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Stefanos Kaxiras, David Klaftenegger, Magnus Norgren, Alberto Ros, and Konstantinos Sagonas. 2015. Turning Centralized Coherence and Distributed Critical-Section Execution on Their Head: A New Approach for Scalable Distributed Shared Memory. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC '15). 3--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Pete Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. 1994. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference (San Francisco, California, USA) (WTEC '94).Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. 1992. Lazy Release Consistency for Software Distributed Shared Memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture (Queensland, Australia) (ISCA '92). 13--21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Charles H. Koelbel, David Loveman, Robert S. Schreiber, Guy L. Steele Jr., and Mary Zosel. 1993. High Performance Fortran Handbook. The MIT Press.Google ScholarGoogle Scholar
  45. Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: A Compiler-Free PGAS Library. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, Oregon, USA) (PGAS '14). 5:1--5:10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Okwan Kwon, Fahed Jubair, Rudolf Eigenmann, and Samuel Midkiff. 2012. A Hybrid Approach of OpenMP for Clusters. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New Orleans, Louisiana, USA) (PPoPP '12). 75--84.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jinpil Lee and Mitsuhisa Sato. 2010. Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems. In Proceedings of the 39th International Conference on Parallel Processing Workshops (San Diego, California, USA) (ICPPW '10). 413--420.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Kai Li and Paul Hudak. 1989. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems 7, 4 (Nov. 1989), 321--359.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Jimmy Aguilar Mena, Omar Shaaban, Vicenç Beltran, Paul Carpenter, Eduard Ayguade, and Jesus Labarta. 2022. OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks. In Proceedings of the 28th International European Conference on Parallel and Distributed Computing (Glasgow, Scotland, UK) (Euro-Par '22). 319--334.Google ScholarGoogle Scholar
  50. Seung-Jai Min, Costin Iancu, and Katherine Yelick. 2011. Hierarchical Work Stealing on Manycore Clusters. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (Galveston Island, Texas, USA) (PGAS '11). 1--10.Google ScholarGoogle Scholar
  51. Eric Mohr, David A. Kranz, and Robert H. Halstead. 1990. Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming (Nice, France) (LFP '90). 185--197.Google ScholarGoogle Scholar
  52. Alessandro Morari, Antonino Tumeo, Daniel Chavarría-Miranda, Oreste Villa, and Mateo Valero. 2014. Scaling Irregular Applications through Data Aggregation and Software Multithreading. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (Phoenix, Arizona, USA) (IPDPS '14). 1126--1135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Jun Nakashima and Kenjiro Taura. 2014. MassiveThreads: A Thread Library for High Productivity Languages. Concurrent Objects and Beyond 8665 (Jan. 2014), 222--238.Google ScholarGoogle Scholar
  54. Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014. Grappa: A Latency-Tolerant Runtime for Large-Scale Irregular Applications. In Proceedings of the First International Workshop on Rack-Scale Computing (Amsterdam, The Netherlands) (WRSC '14). 1--7.Google ScholarGoogle Scholar
  55. Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-Tolerant Software Distributed Shared Memory. In Proceedings of the 2015 USENIX Annual Technical Conference (Denver, Colorado, USA) (USENIX ATC '15). 291--305.Google ScholarGoogle Scholar
  56. Jaroslaw Nieplocha, Robert J. Harrison, and Richard J. Littlefield. 1996. Global Arrays: A Nonuniform Memory Access Programming Model for High-Performance Computers. The Journal of Supercomputing 10, 2 (1996), 169--189.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Robert W. Numrich and John Reid. 1998. Co-Array Fortran for Parallel Programming. SIGPLAN Fortran Forum 17, 2 (Aug. 1998), 1--31.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2006. UTS: An Unbalanced Tree Search Benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (New Orleans, Los Angeles, USA) (LCPC '06). 235--250.Google ScholarGoogle Scholar
  59. Jeeva Paudel, Olivier Tardieu, and José Nelson Amaral. 2013. On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks. In Proceedings of the 42nd International Conference on Parallel Processing (Lyon, France) (ICPP '13). 100--109.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Keith H. Randall. 1998. Cilk: Efficient Multithreaded Computing. Ph. D. Dissertation. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  61. James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly Media.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Tao B. Schardl and I-Ting Angelina Lee. 2023. OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel Code. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada) (PPoPP '23). 189--203.Google ScholarGoogle Scholar
  63. Joseph Schuchart and José Gracia. 2019. Global Task Data-Dependencies in PGAS Applications. In High Performance Computing: the 34th International Conference, ISC High Performance 2019 (Frankfurt/Main, Germany) (ISC '19). 312--329.Google ScholarGoogle Scholar
  64. Shumpei Shiina and Kenjiro Taura. 2019. Almost Deterministic Work Stealing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado, USA) (SC '19). 47:1--47:16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Shumpei Shiina and Kenjiro Taura. 2022. Distributed Continuation Stealing is More Scalable than You Might Think. In Proceedings of the 2022 IEEE International Conference on Cluster Computing (Heidelberg, Germany) (Cluster '22). 129--141.Google ScholarGoogle ScholarCross RefCross Ref
  66. Shumpei Shiina and Kenjiro Taura. 2022. Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work Stealing. IEEE Transactions on Parallel and Distributed Systems 33, 12 (Dec. 2022), 4530--4546.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Min Si, Huansong Fu, Jeff R. Hammond, and Pavan Balaji. 2021. OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations. In Proceedings of the 8th Workshop on OpenSHMEM and Related Technologies (Virtual Event) (OpenSHMEM '21). 39--60.Google ScholarGoogle Scholar
  68. Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii, USA) (MICRO-48). 647--659.Google ScholarGoogle Scholar
  69. Kenjiro Taura, Jun Nakashima, Rio Yokota, and Naoya Maruyama. 2012. A Task Parallel Implementation of Fast Multipole Methods. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis-Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (Salt Lake City, Utah, USA) (ScalA' 12). 617--625.Google ScholarGoogle Scholar
  70. Keisuke Tsugane, Jinpil Lee, Hitoshi Murai, and Mitsuhisa Sato. 2018. Multi-Tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-Core Clusters. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (Chiyoda, Tokyo, Japan) (HPC Asia 2018). 75--85.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Rio Yokota and Lorena Barba. 2020. GitHub repository: exafmm/exafmm-beta. Retrieved 2022-11-30 from https://github.com/exafmm/exafmm-betaGoogle ScholarGoogle Scholar
  72. Rio Yokota, Lorena A. Barba, Tetsu Narumi, and Kenji Yasuoka. 2013. Petascale Turbulence Simulation Using a Highly Parallel Fast Multipole Method on GPUs. Computer Physics Communications 184, 3 (2013), 445--455.Google ScholarGoogle ScholarCross RefCross Ref
  73. Jin Zhang, Xiangyao Yu, Zhengwei Qi, and Haibing Guan. 2022. Falcon: A Timestamp-based Protocol to Maximize the Cache Efficiency in the Distributed Shared Memory. In Proceedings of the 36th IEEE International Parallel and Distributed Processing Symposium (Lyon, France) (IPDPS '22). 974--984.Google ScholarGoogle ScholarCross RefCross Ref
  74. Wei Zhang, Olivier Tardieu, David Grove, Benjamin Herta, Tomio Kamada, Vijay Saraswat, and Mikio Takeuchi. 2014. GLB: Lifeline-Based Global Load Balancing Library in X10. In Proceedings of the First Workshop on Parallel Programming for Analytics Applications (Orlando, Florida, USA) (PPAA '14). 31--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Zhang Zhang, Jeevan Savant, and Steven Seidel. 2006. A UPC Runtime System Based on MPI and POSIX Threads. In Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (Montbeliard-Sochaux, France) (PDP '06). 195--202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: A PGAS Extension for C++. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (Phoenix, Arizona, USA) (IPDPS '14). 1105--1114.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Itoyori: Reconciling Global Address Space and Global Fork-Join Task Parallelism

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Article Metrics

        • Downloads (Last 12 months)358
        • Downloads (Last 6 weeks)90

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader