research-article

Open Access

Itoyori: Reconciling Global Address Space and Global Fork-Join Task Parallelism

Authors:
Shumpei Shiina

The University of Tokyo, Tokyo, Japan

The University of Tokyo, Tokyo, Japan

https://orcid.org/0000-0002-9129-3448
View Profile

,
Kenjiro Taura

The University of Tokyo, Tokyo, Japan

The University of Tokyo, Tokyo, Japan

https://orcid.org/0000-0001-5224-382X
View Profile

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2023Article No.: 14Pages 1–15https://doi.org/10.1145/3581784.3607049

Published:11 November 2023Publication History

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–15

ABSTRACT

This paper introduces Itoyori, a task-parallel runtime system designed to tackle the challenge of scaling task parallelism (more specifically, nested fork-join parallelism) beyond a single node. The partitioned global address space (PGAS) model is often employed in task-parallel systems, but naively combining them can lead to poor performance due to fine-grained and redundant remote memory accesses. Itoyori addresses this issue by automatically caching global memory accesses at runtime, enabling efficient cache sharing among parallel tasks running on the same processor. As a real-world case study, we ported an existing task-parallel implementation of the Fast Multipole Method (FMM) to distributed memory with Itoyori and achieved a 7.5× speedup when scaled from a single node to 12 nodes and up to 6.0× faster performance than without caching. This study demonstrates that global-view fork-join programming can be made practical and scalable, while requiring minimal changes to the shared-memory code.

References

Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2000. The Data Locality of Work Stealing. In Proceedings of the Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures (Bar Harbor, Maine, USA) (SPAA '00). 1--12.Google ScholarDigital Library
Sarita V. Adve and Mark D. Hill. 1990. Weak Ordering - A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, USA) (ISCA '90). 2--14.Google Scholar
Shigeki Akiyama and Kenjiro Taura. 2015. Uni-Address Threads: Scalable Thread Management for RDMA-Based Work Stealing. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC '15). 15--26.Google ScholarDigital Library
Shigeki Akiyama and Kenjiro Taura. 2016. Scalable Work Stealing of Native Threads on an x86-64 Infiniband Cluster. Journal of Information Processing 24, 3 (May 2016), 583--596.Google ScholarCross Ref
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2009. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Proceedings of the 15th International European Conference on Parallel and Distributed Computing (Delft, The Netherlands) (Euro-Par '09). 863--874.Google ScholarDigital Library
Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. 2008. The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems 20, 3 (June 2008), 404--418.Google Scholar
John Bachan, Scott Baden, Dan Bonachea, Johnny Corbino, Johnathan Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian Van Straalen, and Daniel Waters. 2022. UPC++ v1.0 Programmer's Guide, Revision 2022.9.0. Technical Report LBNL-2001479. Lawrence Berkeley National Laboratory, USA.Google Scholar
Ayon Basumallik and Rudolf Eigenmann. 2006. Optimizing Irregular Shared-Memory Applications for Distributed-Memory Systems. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, New York, USA) (PPoPP '06). 119--128.Google ScholarDigital Library
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Salt Lake City, Utah, USA) (SC '12). 66:1--66:11.Google ScholarDigital Library
J. K. Bennett, J. B. Carter, and W. Zwaenepoel. 1990. Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence. In Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Seattle, Washington, USA) (PPOPP '90). 168--176.Google Scholar
B.N. Bershad, M.J. Zekauskas, and W.A. Sawdon. 1993. The Midway Distributed Shared Memory System. In Digest of Papers. The 38th IEEE Computer Society International Conference (San Francisco, California, USA) (COMPCON Spring '93). 528--537.Google Scholar
Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling Irregular Parallel Computations on Hierarchical Caches. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (San Jose, California, USA) (SPAA '11). 355--366.Google ScholarDigital Library
Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (Padua, Italy) (SPAA '96). 297--308.Google Scholar
Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. Dag-Consistent Distributed Shared Memory. In Proceedings of the 10th International Parallel Processing Symposium (Honolulu, Hawaii, USA) (IPPS '96). 132--141.Google Scholar
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Santa Barbara, California, USA) (PPoPP '95). 207--216.Google ScholarDigital Library
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM 46, 5 (Sept. 1999), 720--748.Google ScholarDigital Library
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, and Jack J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science & Engineering 15, 6 (2013), 36--45.Google ScholarDigital Library
Javier Bueno, Luis Martinell, Alejandro Duran, Montse Farreras, Xavier Martorell, Rosa M Badia, Eduard Ayguade, and Jesús Labarta. 2011. Productive Cluster Programming with OmpSs. In Proceedings of the 17th International European Conference on Parallel and Distributed Computing (Bordeaux, France) (Euro-Par '11). 555--566.Google ScholarCross Ref
Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. 2018. Efficient Distributed Memory Management with RDMA and Caching. Proceedings of the VLDB Endowment 11, 11 (July 2018), 1604--1617.Google ScholarDigital Library
Hannah Cartier, James Dinan, and D. Brian Larkins. 2021. Optimizing Work Stealing Communication with Structured Atomic Operations. In Proceedings of the 50th International Conference on Parallel Processing (Lemont, Illinois, USA) (ICPP '21). 36:1--36:10.Google Scholar
Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel Programmability and the Chapel Language. International Journal of High Performance Computing Applications 21, 3 (2007), 291--312.Google ScholarDigital Library
Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM for the PGAS Community. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model (New York, New York, USA) (PGAS '10). 1--3.Google ScholarDigital Library
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (San Diego, California, USA) (OOPSLA '05). 519--538.Google ScholarDigital Library
Ho-Ren Chuang, Robert Lyerly, Stefan Lankes, and Binoy Ravindran. 2020. Scaling Shared Memory Multiprocessing Applications in Non-Cache-Coherent Domains. In Proceedings of the 13th ACM International Systems and Storage Conference (Haifa, Israel) (SYSTOR '20). 13--24.Google ScholarDigital Library
Salvatore Di Girolamo, Flavio Vella, and Torsten Hoefler. 2017. Transparent Caching for RMA Systems. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (Orlando, Florida, USA) (IPDPS '17). 1018--1027.Google ScholarCross Ref
James Dinan, Sriram Krishnamoorthy, D. Brian Larkins, Jarek Nieplocha, and P. Sadayappan. 2008. Scioto: A Framework for Global-View Task Parallelism. In Proceedings of the 37th International Conference on Parallel Processing (Portland, Oregon, USA) (ICPP '08). 586--593.Google Scholar
James Dinan, D. Brian Larkins, Ponnuswamy Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable Work Stealing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Portland, Oregon, USA) (SC '09). 53:1--53:11.Google ScholarDigital Library
Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick. 2005. UPC: Distributed Shared Memory Programming. John Wiley & Sons.Google ScholarDigital Library
Wataru Endo, Shigeyuki Sato, and Kenjiro Taura. 2020. MENPS: A Decentralized Distributed Shared Memory Exploiting RDMA. In Proceedings of 2020 IEEE/ACM Fourth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (Virtual Event) (IPDRM '20). 9--16.Google ScholarCross Ref
Michael P. Ferguson and Daniel Buettner. 2015. Caching Puts and Gets in a PGAS Language Runtime. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models (Washington, District of Columbia, USA) (PGAS '15). 13--24.Google Scholar
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Montreal, Quebec, Canada) (PLDI '98). 212--223.Google Scholar
Karl Fuerlinger, Tobias Fuchs, and Roger Kowalewski. 2016. DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorithms. In Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications (Sydney, NSW, Australia) (HPCC '16). 983--990.Google ScholarCross Ref
Thierry Gautier, Xavier Besseron, and Laurent Pigeon. 2007. KAAPI: A Thread Scheduling Runtime System for Data Flow Computations on Cluster of Multi-Processors. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation (London, Ontario, Canada) (PASCO '07). 15--23.Google ScholarDigital Library
Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. 1990. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, USA) (ISCA '90). 15--26.Google ScholarDigital Library
Sayan Ghosh, Yanfei Guo, Pavan Balaji, and Assefaw H. Gebremedhin. 2021. RMACXX: An Efficient High-Level C++ Interface over MPI-3 RMA. In Proceedings of the 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (Melbourne, Australia) (CCGrid '21). 143--155.Google Scholar
Max Grossman, Vivek Kumar, Zoran Budimlić, and Vivek Sarkar. 2016. Integrating Asynchronous Task Parallelism with OpenSHMEM. In Proceedings of the Third Workshop on OpenSHMEM and Related Technologies (Baltimore, Maryland, USA) (OpenSHMEM '16). 3--17.Google ScholarCross Ref
Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. 2009. Backtracking-Based Load Balancing. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Raleigh, North Carolina, USA) (PPoPP '09). 55--64.Google ScholarDigital Library
Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. 1992. Compiling Fortran D for MIMD Distributed-Memory Machines. Commun. ACM 35, 8 (1992), 66--80.Google ScholarDigital Library
Torsten Hoefler, James Dinan, Rajeev Thakur, Brian Barrett, Pavan Balaji, William Gropp, and Keith Underwood. 2015. Remote Memory Access Programming in MPI-3. ACM Transactions on Parallel Computing 2, 2 (July 2015), 1--26.Google ScholarDigital Library
Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, Oregon, USA) (PGAS '14). 6:1--6:11.Google ScholarDigital Library
Stefanos Kaxiras, David Klaftenegger, Magnus Norgren, Alberto Ros, and Konstantinos Sagonas. 2015. Turning Centralized Coherence and Distributed Critical-Section Execution on Their Head: A New Approach for Scalable Distributed Shared Memory. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC '15). 3--14.Google ScholarDigital Library
Pete Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. 1994. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference (San Francisco, California, USA) (WTEC '94).Google ScholarDigital Library
Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. 1992. Lazy Release Consistency for Software Distributed Shared Memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture (Queensland, Australia) (ISCA '92). 13--21.Google ScholarDigital Library
Charles H. Koelbel, David Loveman, Robert S. Schreiber, Guy L. Steele Jr., and Mary Zosel. 1993. High Performance Fortran Handbook. The MIT Press.Google Scholar
Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: A Compiler-Free PGAS Library. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, Oregon, USA) (PGAS '14). 5:1--5:10.Google ScholarDigital Library
Okwan Kwon, Fahed Jubair, Rudolf Eigenmann, and Samuel Midkiff. 2012. A Hybrid Approach of OpenMP for Clusters. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New Orleans, Louisiana, USA) (PPoPP '12). 75--84.Google ScholarDigital Library
Jinpil Lee and Mitsuhisa Sato. 2010. Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems. In Proceedings of the 39th International Conference on Parallel Processing Workshops (San Diego, California, USA) (ICPPW '10). 413--420.Google ScholarDigital Library
Kai Li and Paul Hudak. 1989. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems 7, 4 (Nov. 1989), 321--359.Google ScholarDigital Library
Jimmy Aguilar Mena, Omar Shaaban, Vicenç Beltran, Paul Carpenter, Eduard Ayguade, and Jesus Labarta. 2022. OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks. In Proceedings of the 28th International European Conference on Parallel and Distributed Computing (Glasgow, Scotland, UK) (Euro-Par '22). 319--334.Google Scholar
Seung-Jai Min, Costin Iancu, and Katherine Yelick. 2011. Hierarchical Work Stealing on Manycore Clusters. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (Galveston Island, Texas, USA) (PGAS '11). 1--10.Google Scholar
Eric Mohr, David A. Kranz, and Robert H. Halstead. 1990. Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming (Nice, France) (LFP '90). 185--197.Google Scholar
Alessandro Morari, Antonino Tumeo, Daniel Chavarría-Miranda, Oreste Villa, and Mateo Valero. 2014. Scaling Irregular Applications through Data Aggregation and Software Multithreading. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (Phoenix, Arizona, USA) (IPDPS '14). 1126--1135.Google ScholarDigital Library
Jun Nakashima and Kenjiro Taura. 2014. MassiveThreads: A Thread Library for High Productivity Languages. Concurrent Objects and Beyond 8665 (Jan. 2014), 222--238.Google Scholar
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014. Grappa: A Latency-Tolerant Runtime for Large-Scale Irregular Applications. In Proceedings of the First International Workshop on Rack-Scale Computing (Amsterdam, The Netherlands) (WRSC '14). 1--7.Google Scholar
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-Tolerant Software Distributed Shared Memory. In Proceedings of the 2015 USENIX Annual Technical Conference (Denver, Colorado, USA) (USENIX ATC '15). 291--305.Google Scholar
Jaroslaw Nieplocha, Robert J. Harrison, and Richard J. Littlefield. 1996. Global Arrays: A Nonuniform Memory Access Programming Model for High-Performance Computers. The Journal of Supercomputing 10, 2 (1996), 169--189.Google ScholarDigital Library
Robert W. Numrich and John Reid. 1998. Co-Array Fortran for Parallel Programming. SIGPLAN Fortran Forum 17, 2 (Aug. 1998), 1--31.Google ScholarDigital Library
Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2006. UTS: An Unbalanced Tree Search Benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (New Orleans, Los Angeles, USA) (LCPC '06). 235--250.Google Scholar
Jeeva Paudel, Olivier Tardieu, and José Nelson Amaral. 2013. On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks. In Proceedings of the 42nd International Conference on Parallel Processing (Lyon, France) (ICPP '13). 100--109.Google ScholarDigital Library
Keith H. Randall. 1998. Cilk: Efficient Multithreaded Computing. Ph. D. Dissertation. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.Google Scholar
James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly Media.Google ScholarDigital Library
Tao B. Schardl and I-Ting Angelina Lee. 2023. OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel Code. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada) (PPoPP '23). 189--203.Google Scholar
Joseph Schuchart and José Gracia. 2019. Global Task Data-Dependencies in PGAS Applications. In High Performance Computing: the 34th International Conference, ISC High Performance 2019 (Frankfurt/Main, Germany) (ISC '19). 312--329.Google Scholar
Shumpei Shiina and Kenjiro Taura. 2019. Almost Deterministic Work Stealing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado, USA) (SC '19). 47:1--47:16.Google ScholarDigital Library
Shumpei Shiina and Kenjiro Taura. 2022. Distributed Continuation Stealing is More Scalable than You Might Think. In Proceedings of the 2022 IEEE International Conference on Cluster Computing (Heidelberg, Germany) (Cluster '22). 129--141.Google ScholarCross Ref
Shumpei Shiina and Kenjiro Taura. 2022. Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work Stealing. IEEE Transactions on Parallel and Distributed Systems 33, 12 (Dec. 2022), 4530--4546.Google ScholarDigital Library
Min Si, Huansong Fu, Jeff R. Hammond, and Pavan Balaji. 2021. OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations. In Proceedings of the 8th Workshop on OpenSHMEM and Related Technologies (Virtual Event) (OpenSHMEM '21). 39--60.Google Scholar
Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii, USA) (MICRO-48). 647--659.Google Scholar
Kenjiro Taura, Jun Nakashima, Rio Yokota, and Naoya Maruyama. 2012. A Task Parallel Implementation of Fast Multipole Methods. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis-Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (Salt Lake City, Utah, USA) (ScalA' 12). 617--625.Google Scholar
Keisuke Tsugane, Jinpil Lee, Hitoshi Murai, and Mitsuhisa Sato. 2018. Multi-Tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-Core Clusters. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (Chiyoda, Tokyo, Japan) (HPC Asia 2018). 75--85.Google ScholarDigital Library
Rio Yokota and Lorena Barba. 2020. GitHub repository: exafmm/exafmm-beta. Retrieved 2022-11-30 from https://github.com/exafmm/exafmm-betaGoogle Scholar
Rio Yokota, Lorena A. Barba, Tetsu Narumi, and Kenji Yasuoka. 2013. Petascale Turbulence Simulation Using a Highly Parallel Fast Multipole Method on GPUs. Computer Physics Communications 184, 3 (2013), 445--455.Google ScholarCross Ref
Jin Zhang, Xiangyao Yu, Zhengwei Qi, and Haibing Guan. 2022. Falcon: A Timestamp-based Protocol to Maximize the Cache Efficiency in the Distributed Shared Memory. In Proceedings of the 36th IEEE International Parallel and Distributed Processing Symposium (Lyon, France) (IPDPS '22). 974--984.Google ScholarCross Ref
Wei Zhang, Olivier Tardieu, David Grove, Benjamin Herta, Tomio Kamada, Vijay Saraswat, and Mikio Takeuchi. 2014. GLB: Lifeline-Based Global Load Balancing Library in X10. In Proceedings of the First Workshop on Parallel Programming for Analytics Applications (Orlando, Florida, USA) (PPAA '14). 31--40.Google ScholarDigital Library
Zhang Zhang, Jeevan Savant, and Steven Seidel. 2006. A UPC Runtime System Based on MPI and POSIX Threads. In Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (Montbeliard-Sochaux, France) (PDP '06). 195--202.Google ScholarDigital Library
Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: A PGAS Extension for C++. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (Phoenix, Arizona, USA) (IPDPS '14). 1105--1114.Google ScholarDigital Library

Index Terms

Itoyori: Reconciling Global Address Space and Global Fork-Join Task Parallelism
1. Computing methodologies
  1. Distributed computing methodologies
  2. Parallel computing methodologies

Recommendations

Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computation

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Read More
Helper locks for fork-join parallel programming
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose ...
Read More
Helper locks for fork-join parallel programming
PPoPP '10

Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2023
1428 pages
ISBN:9798400701092
DOI:10.1145/3581784
Chair:
Dorian Arnold,
Program Chair:
Rosa M Badia,
Program Co-chair:
Kathryn Mohror
Copyright © 2023 Owner/Author(s)
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 November 2023
Check for updates
Badges
Author Tags
PGAS
task parallelism
fork-join
work stealing
cache coherence
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,516of6,373submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 358
  Total Downloads
- Downloads (Last 12 months)358
- Downloads (Last 6 weeks)90
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.