ABSTRACT
Co-array Fortran (CAF) and Unified Parallel C (UPC) are two emerging languages for single-program, multiple-data global address space programming. These languages boost programmer productivity by providing shared variables for inter-process communication instead of message passing. However, the performance of these emerging languages still has room for improvement. In this paper, we study the performance of variants of the NAS MG, CG, SP, and BT benchmarks on several modern architectures to identify challenges that must be met to deliver top performance. We compare CAF and UPC variants of these programs with the original Fortran+MPI code. Today, CAF and UPC programs deliver scalable performance on clusters only when written to use bulk communication. However, our experiments uncovered some significant performance bottlenecks of UPC codes on all platforms. We account for the root causes limiting UPC performance such as the synchronization model, the communication efficiency of strided data, and source-to-source translation issues. We show that they can be remedied with language extensions, new synchronization constructs, and, finally, adequate optimizations by the back-end C compilers.
- D. Bailey, T. Harris, W. Saphir, R. van der Wijngaart, A. Woo, and M. Yarrow. The NAS parallel benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, Dec. 1995.Google Scholar
- D. Bonachea. Gasnet specification, v1.1. Technical Report CSD-02-1207, U.C. Berkeley, October 2002. Google ScholarDigital Library
- D. Bonachea. Proposal for extending the upc memory copy library functions and supporting extensions to gasnet, v1.0. Technical Report LBNL-56495, Lawrence Berkeley National, October 2004.Google ScholarCross Ref
- F. Cantonnet, Y. Yao, S. Annareddy, A. Mohamed, and T. El-Ghazawi. Performance monitoring and evaluation of a UPC implementation on a NUMA architecture. In Proceedings of the International Parallel and Distributed Processing Symposium, Nice, France, Apr. 2003. Google ScholarDigital Library
- W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, and K. Yelick. A performance analysis of the Berkeley UPC compiler. In Proceedings of the 17th ACM International Conference on Supercomputing, San Francisco, California, June 2003. Google ScholarDigital Library
- C. Coarfa, Y. Dotsenko, J. Eckhardt, and J. Mellor-Crummey. Co-array Fortran Performance and Potential: An NPB Experimental Study. In Proc. of the 16th Intl. Workshop on Languages and Compilers for Parallel Computing, number 2958 in LNCS. Springer-Verlag, October 2-4, 2003.Google Scholar
- Y. Dotsenko, C. Coarfa, and J. Mellor-Crummey. A Multiplatform Co-Array Fortran Compiler. In Proceedings of the 13th Intl. Conference of Parallel Architectures and Compilation Techniques, Antibes Juan-les-Pins, France, September 29 - October 3 2004. Google ScholarDigital Library
- Y. Dotsenko, C. Coarfa, J. Mellor-Crummey, and D. Chavarrça-Miranda. Experiences with Co-Array Fortran on Hardware Shared Memory Platforms. In Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing, September 2004. Google ScholarDigital Library
- T. El-Ghazawi, F. Cantonne, P. Saha, R. Thakur, R. Ross, and D. Bonachea. UPC-IO: A Parallel I/O API for UPC v1.0, July 2004. Available at http://upc.gwu.edu/docs/UPC-IOv1.0.pdf.Google Scholar
- T. A. El-Ghazawi and F. Cantonnet. UPC performance and potential: A NPB experimental study. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing (CDROM), Baltimore, MD, Nov. 2002. IEEE Computer Society. Google ScholarDigital Library
- T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper. UPC Language Specifications v1.1.1, October 2003.Google Scholar
- Intrepid Technology Inc. GCC Unified Parallel C. http://www.intrepid.com/upc.Google Scholar
- J. Mellor-Crummey, R. Fowler, G. Marin, and N. Tallent. HPCView: A tool for top-down analysis of node performance. The Journal of Supercomputing, 23:81--101, 2002. Special Issue with selected papers from the Los Alamos Computer Science Institute Symposium. Google ScholarDigital Library
- V. Naik. A scalable implementation of the NAS parallel benchmark BT on distributed memory systems. IBM Systems Journal, 34(2), 1995. Google ScholarDigital Library
- J. Nieplocha and B. Carpenter. ARMCI: A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-Time Systems, volume 1586 of Lecture Notes in Computer Science, pages 533--546. Springer-Verlag, 1999. Google ScholarDigital Library
- R. W. Numrich and J. K. Reid. Co-Array Fortran for parallel programming. ACM Fortran Forum, 17(2):1--31, August 1998. Google ScholarDigital Library
- Open64 Developers. Open64 compiler and tools. http://sourceforge.net/projects/open64, Sept. 2001.Google Scholar
- Open64/SL Developers. Open64/SL compiler and tools. http://hipersoft.cs.rice.edu/open64, July 2002.Google Scholar
- Rice University. HPCToolkit performance analysis tools. http://www.hipersoft.rice.edu/hpctoolkit.Google Scholar
- Rice University. cafc - A Multiplatform, Open Source Co-Array Fortran Compiler. http://www.hipersoft.rice.edu/caf, Apr. 2005.Google Scholar
- E. Wiebel, D. Greenberg, and S. Seidel. UPC Collective Operations Specifications v1.0, December 2003. Available at http://upc.gwu.edu/docs/UPC Coll Spec V1.0.pdf.Google Scholar
Index Terms
- An evaluation of global address space languages: co-array fortran and unified parallel C
Recommendations
Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computationPartitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Tools-supported HPF and MPI parallelization of the NAS parallel benchmarks
FRONTIERS '96: Proceedings of the 6th Symposium on the Frontiers of Massively Parallel ComputationHigh Performance Fortran (HPF) compilers and communication libraries with the standardized Message Passing Interface (MPI) are becoming widely available, easing the development of portable parallel applications. The Annai tool environment supports ...
A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI
The Gemini interconnect on the Cray XE6 platform provides for lightweight remote direct memory access (RDMA) between nodes, which is useful for implementing partitioned global address space (PGAS) languages like UPC and Co-Array Fortran. In this paper, ...
Comments