ABSTRACT
The steadily increasing number of nodes in high-performance computing systems and the technology and power constraints lead to sparse network topologies. Efficient mapping of application communication patterns to the network topology gains importance as systems grow to petascale and beyond. Such mapping is supported in parallel programming frameworks such as MPI, but is often not well implemented. We show that the topology mapping problem is NP-complete and analyze and compare different practical topology mapping heuristics. We demonstrate an efficient and fast new heuristic which is based on graph similarity and show its utility with application communication patterns on real topologies. Our mapping strategies support heterogeneous networks and show significant reduction of congestion on torus, fat-tree, and the PERCS network topologies, for irregular communication patterns. We also demonstrate that the benefit of topology mapping grows with the network size and show how our algorithms can be used in a practical setting to optimize communication performance. Our efficient topology mapping strategies are shown to reduce network congestion by up to 80%, reduce average dilation by up to 50%, and improve benchmarked communication performance by 18%.
- B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS High-Performance Interconnect. In Proc. of 18th Symposium on High-Performance Interconnects (HotI'10), Aug. 2010. Google ScholarDigital Library
- A. Bhatelé, L. V. Kalé, and S. Kumar. Dynamic topology aware load balancing algorithms for molecular dynamics applications. In ICS '09, pages 110--116, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- S. H. Bokhari. On the mapping problem. IEEE Trans. Comput., 30(3):207--214, 1981. Google ScholarDigital Library
- S. W. Bollinger and S. F. Midkiff. Heuristic technique for processor and link assignment in multicomputers. IEEE Trans. Comput., 40(3):325--333, 1991. Google ScholarDigital Library
- U. Brandes. A faster algorithm for betweenness centrality. The Journal of Math. Sociology, 25(2):163--177, 2001.Google ScholarCross Ref
- E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th national conference, ACM '69, pages 157--172, New York, NY, USA, 1969. ACM. Google ScholarDigital Library
- T. A. Davis. University of Florida Sparse Matrix Collection. NA Digest, 92, 1994.Google Scholar
- J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White, editors. Sourcebook of parallel computing. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. Google ScholarDigital Library
- G. Dueck and T. Scheuer. Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing. J. Comput. Phys., 90(1):161--175, 1990. Google ScholarDigital Library
- M. Gary and D. Johnson. Computers and Intractability: A Guide to NP-Completeness. New York: W H. Freeman and Company, 1979. Google ScholarDigital Library
- J. R. Gilbert, S. Reinhardt, and V. B. Shah. High-performance graph algorithms from parallel sparse matrices. In PARA'06: Proceedings of the 8th international conference on Applied parallel computing, pages 260--269, 2007. Google ScholarDigital Library
- T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski, R. Thakur, and J. L. Traeff. The Scalable Process Topology Interface of MPI 2.2. Concurrency and Computation: Practice and Experience, 23(4):293--310, Aug. 2010. Google ScholarDigital Library
- R. Johari and D. Tan. End-to-end congestion control for the internet: delays and stability. Networking, IEEE/ACM Transactions on, 9(6):818 --832, Dec. 2001. Google ScholarDigital Library
- P. Kogge et al. Exascale computing study: Technology challenges in achieving exascale systems. DARPA Information Processing Techniques Office, Washington, DC, 2008.Google Scholar
- S.-Y. Lee and J. K. Aggarwal. A mapping strategy for parallel processing. IEEE Trans. Comput., 36(4):433--442, 1987. Google ScholarDigital Library
- MPI Forum. fMPI: A Message-Passing Interface Standard. Version 2.2, June 23rd 2009. www.mpi-forum.org.Google Scholar
- D. Pekurovsky. P3DFFT - Highly scalable parallel 3D Fast Fourier Transforms library. Technical report, 2010.Google Scholar
- F. Pellegrini and J. Roman. Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In HPCN Europe'96, pages 493--498, 1996. Google ScholarDigital Library
- A. L. Rosenberg. Issues in the study of graph embeddings. In WG'80, pages 150--176, London, UK, 1981. Google ScholarDigital Library
- K. Schloegel, G. Karypis, and V. Kumar. Parallel static and dynamic multi-constraint graph partitioning. Concurrency and Computation: Practice and Experience, 14(3):219--240, 2002.Google ScholarCross Ref
- H. D. Simon and S.-H. Teng. How good is recursive bisection? SIAM J. Sci. Comput., 18:1436--1445, September 1997. Google ScholarDigital Library
- J. L. Träff. Implementing the MPI process topology mechanism. In Supercomputing '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1--14, 2002. Google ScholarDigital Library
- H. Yu, I.-H. Chung, and J. Moreira. Topology mapping for Blue Gene/L supercomputer. In SC'06, page 116, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
Index Terms
- Generic topology mapping strategies for large-scale parallel architectures
Recommendations
Automatic topology mapping of diverse large-scale parallel applications
ICS '17: Proceedings of the International Conference on SupercomputingTopology-aware mapping aims at assigning tasks to processors in a way that minimizes network load, thus reducing the time spent waiting for communication to complete. Many mapping schemes and algorithms have been proposed. Some are application or domain ...
Topology mapping of irregular parallel applications on torus-connected supercomputers
Supercomputers with ever increasing computing power are being built for scientific applications. As the system size scales up, so does the size of interconnect network. As a result, communication in supercomputers becomes increasingly expensive due to ...
Algorithms for Mapping Parallel Processes onto Grid and Torus Architectures
PDP '15: Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based ProcessingStatic mapping is the assignment of parallel processes to the processing elements (PEs) of a parallel system, where the assignment does not change during the application's lifetime. In our scenario we model an application's computations and their ...
Comments