ABSTRACT
This paper introduces Itoyori, a task-parallel runtime system designed to tackle the challenge of scaling task parallelism (more specifically, nested fork-join parallelism) beyond a single node. The partitioned global address space (PGAS) model is often employed in task-parallel systems, but naively combining them can lead to poor performance due to fine-grained and redundant remote memory accesses. Itoyori addresses this issue by automatically caching global memory accesses at runtime, enabling efficient cache sharing among parallel tasks running on the same processor. As a real-world case study, we ported an existing task-parallel implementation of the Fast Multipole Method (FMM) to distributed memory with Itoyori and achieved a 7.5× speedup when scaled from a single node to 12 nodes and up to 6.0× faster performance than without caching. This study demonstrates that global-view fork-join programming can be made practical and scalable, while requiring minimal changes to the shared-memory code.
- Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2000. The Data Locality of Work Stealing. In Proceedings of the Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures (Bar Harbor, Maine, USA) (SPAA '00). 1--12.Google ScholarDigital Library
- Sarita V. Adve and Mark D. Hill. 1990. Weak Ordering - A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, USA) (ISCA '90). 2--14.Google Scholar
- Shigeki Akiyama and Kenjiro Taura. 2015. Uni-Address Threads: Scalable Thread Management for RDMA-Based Work Stealing. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC '15). 15--26.Google ScholarDigital Library
- Shigeki Akiyama and Kenjiro Taura. 2016. Scalable Work Stealing of Native Threads on an x86-64 Infiniband Cluster. Journal of Information Processing 24, 3 (May 2016), 583--596.Google ScholarCross Ref
- Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2009. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. In Proceedings of the 15th International European Conference on Parallel and Distributed Computing (Delft, The Netherlands) (Euro-Par '09). 863--874.Google ScholarDigital Library
- Eduard Ayguadé, Nawal Copty, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang. 2008. The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems 20, 3 (June 2008), 404--418.Google Scholar
- John Bachan, Scott Baden, Dan Bonachea, Johnny Corbino, Johnathan Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian Van Straalen, and Daniel Waters. 2022. UPC++ v1.0 Programmer's Guide, Revision 2022.9.0. Technical Report LBNL-2001479. Lawrence Berkeley National Laboratory, USA.Google Scholar
- Ayon Basumallik and Rudolf Eigenmann. 2006. Optimizing Irregular Shared-Memory Applications for Distributed-Memory Systems. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New York, New York, USA) (PPoPP '06). 119--128.Google ScholarDigital Library
- Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Salt Lake City, Utah, USA) (SC '12). 66:1--66:11.Google ScholarDigital Library
- J. K. Bennett, J. B. Carter, and W. Zwaenepoel. 1990. Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence. In Proceedings of the Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Seattle, Washington, USA) (PPOPP '90). 168--176.Google Scholar
- B.N. Bershad, M.J. Zekauskas, and W.A. Sawdon. 1993. The Midway Distributed Shared Memory System. In Digest of Papers. The 38th IEEE Computer Society International Conference (San Francisco, California, USA) (COMPCON Spring '93). 528--537.Google Scholar
- Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling Irregular Parallel Computations on Hierarchical Caches. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (San Jose, California, USA) (SPAA '11). 355--366.Google ScholarDigital Library
- Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (Padua, Italy) (SPAA '96). 297--308.Google Scholar
- Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H. Randall. 1996. Dag-Consistent Distributed Shared Memory. In Proceedings of the 10th International Parallel Processing Symposium (Honolulu, Hawaii, USA) (IPPS '96). 132--141.Google Scholar
- Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An Efficient Multithreaded Runtime System. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Santa Barbara, California, USA) (PPoPP '95). 207--216.Google ScholarDigital Library
- Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. J. ACM 46, 5 (Sept. 1999), 720--748.Google ScholarDigital Library
- George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, and Jack J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science & Engineering 15, 6 (2013), 36--45.Google ScholarDigital Library
- Javier Bueno, Luis Martinell, Alejandro Duran, Montse Farreras, Xavier Martorell, Rosa M Badia, Eduard Ayguade, and Jesús Labarta. 2011. Productive Cluster Programming with OmpSs. In Proceedings of the 17th International European Conference on Parallel and Distributed Computing (Bordeaux, France) (Euro-Par '11). 555--566.Google ScholarCross Ref
- Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. 2018. Efficient Distributed Memory Management with RDMA and Caching. Proceedings of the VLDB Endowment 11, 11 (July 2018), 1604--1617.Google ScholarDigital Library
- Hannah Cartier, James Dinan, and D. Brian Larkins. 2021. Optimizing Work Stealing Communication with Structured Atomic Operations. In Proceedings of the 50th International Conference on Parallel Processing (Lemont, Illinois, USA) (ICPP '21). 36:1--36:10.Google Scholar
- Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel Programmability and the Chapel Language. International Journal of High Performance Computing Applications 21, 3 (2007), 291--312.Google ScholarDigital Library
- Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, and Lauren Smith. 2010. Introducing OpenSHMEM: SHMEM for the PGAS Community. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model (New York, New York, USA) (PGAS '10). 1--3.Google ScholarDigital Library
- Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (San Diego, California, USA) (OOPSLA '05). 519--538.Google ScholarDigital Library
- Ho-Ren Chuang, Robert Lyerly, Stefan Lankes, and Binoy Ravindran. 2020. Scaling Shared Memory Multiprocessing Applications in Non-Cache-Coherent Domains. In Proceedings of the 13th ACM International Systems and Storage Conference (Haifa, Israel) (SYSTOR '20). 13--24.Google ScholarDigital Library
- Salvatore Di Girolamo, Flavio Vella, and Torsten Hoefler. 2017. Transparent Caching for RMA Systems. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (Orlando, Florida, USA) (IPDPS '17). 1018--1027.Google ScholarCross Ref
- James Dinan, Sriram Krishnamoorthy, D. Brian Larkins, Jarek Nieplocha, and P. Sadayappan. 2008. Scioto: A Framework for Global-View Task Parallelism. In Proceedings of the 37th International Conference on Parallel Processing (Portland, Oregon, USA) (ICPP '08). 586--593.Google Scholar
- James Dinan, D. Brian Larkins, Ponnuswamy Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable Work Stealing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Portland, Oregon, USA) (SC '09). 53:1--53:11.Google ScholarDigital Library
- Tarek El-Ghazawi, William Carlson, Thomas Sterling, and Katherine Yelick. 2005. UPC: Distributed Shared Memory Programming. John Wiley & Sons.Google ScholarDigital Library
- Wataru Endo, Shigeyuki Sato, and Kenjiro Taura. 2020. MENPS: A Decentralized Distributed Shared Memory Exploiting RDMA. In Proceedings of 2020 IEEE/ACM Fourth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (Virtual Event) (IPDRM '20). 9--16.Google ScholarCross Ref
- Michael P. Ferguson and Daniel Buettner. 2015. Caching Puts and Gets in a PGAS Language Runtime. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models (Washington, District of Columbia, USA) (PGAS '15). 13--24.Google Scholar
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Montreal, Quebec, Canada) (PLDI '98). 212--223.Google Scholar
- Karl Fuerlinger, Tobias Fuchs, and Roger Kowalewski. 2016. DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorithms. In Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications (Sydney, NSW, Australia) (HPCC '16). 983--990.Google ScholarCross Ref
- Thierry Gautier, Xavier Besseron, and Laurent Pigeon. 2007. KAAPI: A Thread Scheduling Runtime System for Data Flow Computations on Cluster of Multi-Processors. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation (London, Ontario, Canada) (PASCO '07). 15--23.Google ScholarDigital Library
- Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. 1990. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, USA) (ISCA '90). 15--26.Google ScholarDigital Library
- Sayan Ghosh, Yanfei Guo, Pavan Balaji, and Assefaw H. Gebremedhin. 2021. RMACXX: An Efficient High-Level C++ Interface over MPI-3 RMA. In Proceedings of the 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (Melbourne, Australia) (CCGrid '21). 143--155.Google Scholar
- Max Grossman, Vivek Kumar, Zoran Budimlić, and Vivek Sarkar. 2016. Integrating Asynchronous Task Parallelism with OpenSHMEM. In Proceedings of the Third Workshop on OpenSHMEM and Related Technologies (Baltimore, Maryland, USA) (OpenSHMEM '16). 3--17.Google ScholarCross Ref
- Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. 2009. Backtracking-Based Load Balancing. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Raleigh, North Carolina, USA) (PPoPP '09). 55--64.Google ScholarDigital Library
- Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. 1992. Compiling Fortran D for MIMD Distributed-Memory Machines. Commun. ACM 35, 8 (1992), 66--80.Google ScholarDigital Library
- Torsten Hoefler, James Dinan, Rajeev Thakur, Brian Barrett, Pavan Balaji, William Gropp, and Keith Underwood. 2015. Remote Memory Access Programming in MPI-3. ACM Transactions on Parallel Computing 2, 2 (July 2015), 1--26.Google ScholarDigital Library
- Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, Oregon, USA) (PGAS '14). 6:1--6:11.Google ScholarDigital Library
- Stefanos Kaxiras, David Klaftenegger, Magnus Norgren, Alberto Ros, and Konstantinos Sagonas. 2015. Turning Centralized Coherence and Distributed Critical-Section Execution on Their Head: A New Approach for Scalable Distributed Shared Memory. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (Portland, Oregon, USA) (HPDC '15). 3--14.Google ScholarDigital Library
- Pete Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. 1994. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference (San Francisco, California, USA) (WTEC '94).Google ScholarDigital Library
- Pete Keleher, Alan L. Cox, and Willy Zwaenepoel. 1992. Lazy Release Consistency for Software Distributed Shared Memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture (Queensland, Australia) (ISCA '92). 13--21.Google ScholarDigital Library
- Charles H. Koelbel, David Loveman, Robert S. Schreiber, Guy L. Steele Jr., and Mary Zosel. 1993. High Performance Fortran Handbook. The MIT Press.Google Scholar
- Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: A Compiler-Free PGAS Library. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, Oregon, USA) (PGAS '14). 5:1--5:10.Google ScholarDigital Library
- Okwan Kwon, Fahed Jubair, Rudolf Eigenmann, and Samuel Midkiff. 2012. A Hybrid Approach of OpenMP for Clusters. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New Orleans, Louisiana, USA) (PPoPP '12). 75--84.Google ScholarDigital Library
- Jinpil Lee and Mitsuhisa Sato. 2010. Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems. In Proceedings of the 39th International Conference on Parallel Processing Workshops (San Diego, California, USA) (ICPPW '10). 413--420.Google ScholarDigital Library
- Kai Li and Paul Hudak. 1989. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems 7, 4 (Nov. 1989), 321--359.Google ScholarDigital Library
- Jimmy Aguilar Mena, Omar Shaaban, Vicenç Beltran, Paul Carpenter, Eduard Ayguade, and Jesus Labarta. 2022. OmpSs-2@Cluster: Distributed Memory Execution of Nested OpenMP-style Tasks. In Proceedings of the 28th International European Conference on Parallel and Distributed Computing (Glasgow, Scotland, UK) (Euro-Par '22). 319--334.Google Scholar
- Seung-Jai Min, Costin Iancu, and Katherine Yelick. 2011. Hierarchical Work Stealing on Manycore Clusters. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (Galveston Island, Texas, USA) (PGAS '11). 1--10.Google Scholar
- Eric Mohr, David A. Kranz, and Robert H. Halstead. 1990. Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. In Proceedings of the 1990 ACM Conference on LISP and Functional Programming (Nice, France) (LFP '90). 185--197.Google Scholar
- Alessandro Morari, Antonino Tumeo, Daniel Chavarría-Miranda, Oreste Villa, and Mateo Valero. 2014. Scaling Irregular Applications through Data Aggregation and Software Multithreading. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (Phoenix, Arizona, USA) (IPDPS '14). 1126--1135.Google ScholarDigital Library
- Jun Nakashima and Kenjiro Taura. 2014. MassiveThreads: A Thread Library for High Productivity Languages. Concurrent Objects and Beyond 8665 (Jan. 2014), 222--238.Google Scholar
- Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014. Grappa: A Latency-Tolerant Runtime for Large-Scale Irregular Applications. In Proceedings of the First International Workshop on Rack-Scale Computing (Amsterdam, The Netherlands) (WRSC '14). 1--7.Google Scholar
- Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-Tolerant Software Distributed Shared Memory. In Proceedings of the 2015 USENIX Annual Technical Conference (Denver, Colorado, USA) (USENIX ATC '15). 291--305.Google Scholar
- Jaroslaw Nieplocha, Robert J. Harrison, and Richard J. Littlefield. 1996. Global Arrays: A Nonuniform Memory Access Programming Model for High-Performance Computers. The Journal of Supercomputing 10, 2 (1996), 169--189.Google ScholarDigital Library
- Robert W. Numrich and John Reid. 1998. Co-Array Fortran for Parallel Programming. SIGPLAN Fortran Forum 17, 2 (Aug. 1998), 1--31.Google ScholarDigital Library
- Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2006. UTS: An Unbalanced Tree Search Benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (New Orleans, Los Angeles, USA) (LCPC '06). 235--250.Google Scholar
- Jeeva Paudel, Olivier Tardieu, and José Nelson Amaral. 2013. On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks. In Proceedings of the 42nd International Conference on Parallel Processing (Lyon, France) (ICPP '13). 100--109.Google ScholarDigital Library
- Keith H. Randall. 1998. Cilk: Efficient Multithreaded Computing. Ph. D. Dissertation. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.Google Scholar
- James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly Media.Google ScholarDigital Library
- Tao B. Schardl and I-Ting Angelina Lee. 2023. OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel Code. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada) (PPoPP '23). 189--203.Google Scholar
- Joseph Schuchart and José Gracia. 2019. Global Task Data-Dependencies in PGAS Applications. In High Performance Computing: the 34th International Conference, ISC High Performance 2019 (Frankfurt/Main, Germany) (ISC '19). 312--329.Google Scholar
- Shumpei Shiina and Kenjiro Taura. 2019. Almost Deterministic Work Stealing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado, USA) (SC '19). 47:1--47:16.Google ScholarDigital Library
- Shumpei Shiina and Kenjiro Taura. 2022. Distributed Continuation Stealing is More Scalable than You Might Think. In Proceedings of the 2022 IEEE International Conference on Cluster Computing (Heidelberg, Germany) (Cluster '22). 129--141.Google ScholarCross Ref
- Shumpei Shiina and Kenjiro Taura. 2022. Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work Stealing. IEEE Transactions on Parallel and Distributed Systems 33, 12 (Dec. 2022), 4530--4546.Google ScholarDigital Library
- Min Si, Huansong Fu, Jeff R. Hammond, and Pavan Balaji. 2021. OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations. In Proceedings of the 8th Workshop on OpenSHMEM and Related Technologies (Virtual Event) (OpenSHMEM '21). 39--60.Google Scholar
- Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii, USA) (MICRO-48). 647--659.Google Scholar
- Kenjiro Taura, Jun Nakashima, Rio Yokota, and Naoya Maruyama. 2012. A Task Parallel Implementation of Fast Multipole Methods. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis-Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (Salt Lake City, Utah, USA) (ScalA' 12). 617--625.Google Scholar
- Keisuke Tsugane, Jinpil Lee, Hitoshi Murai, and Mitsuhisa Sato. 2018. Multi-Tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-Core Clusters. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (Chiyoda, Tokyo, Japan) (HPC Asia 2018). 75--85.Google ScholarDigital Library
- Rio Yokota and Lorena Barba. 2020. GitHub repository: exafmm/exafmm-beta. Retrieved 2022-11-30 from https://github.com/exafmm/exafmm-betaGoogle Scholar
- Rio Yokota, Lorena A. Barba, Tetsu Narumi, and Kenji Yasuoka. 2013. Petascale Turbulence Simulation Using a Highly Parallel Fast Multipole Method on GPUs. Computer Physics Communications 184, 3 (2013), 445--455.Google ScholarCross Ref
- Jin Zhang, Xiangyao Yu, Zhengwei Qi, and Haibing Guan. 2022. Falcon: A Timestamp-based Protocol to Maximize the Cache Efficiency in the Distributed Shared Memory. In Proceedings of the 36th IEEE International Parallel and Distributed Processing Symposium (Lyon, France) (IPDPS '22). 974--984.Google ScholarCross Ref
- Wei Zhang, Olivier Tardieu, David Grove, Benjamin Herta, Tomio Kamada, Vijay Saraswat, and Mikio Takeuchi. 2014. GLB: Lifeline-Based Global Load Balancing Library in X10. In Proceedings of the First Workshop on Parallel Programming for Analytics Applications (Orlando, Florida, USA) (PPAA '14). 31--40.Google ScholarDigital Library
- Zhang Zhang, Jeevan Savant, and Steven Seidel. 2006. A UPC Runtime System Based on MPI and POSIX Threads. In Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (Montbeliard-Sochaux, France) (PDP '06). 195--202.Google ScholarDigital Library
- Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: A PGAS Extension for C++. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (Phoenix, Arizona, USA) (IPDPS '14). 1105--1114.Google ScholarDigital Library
Index Terms
- Itoyori: Reconciling Global Address Space and Global Fork-Join Task Parallelism
Recommendations
Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computationPartitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Helper locks for fork-join parallel programming
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingHelper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose ...
Helper locks for fork-join parallel programming
PPoPP '10Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose ...
Comments