ABSTRACT
Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86__64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.
- InfiniBand Trade Association, InfiniBand Architecture Specification, Volume 1, Release 1.0. http://www.infinibandta.com.Google Scholar
- Intel MIC Architecture. http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html.Google Scholar
- Intel MPI Library. http://software.intel.com/en-us/intel-mpi-library/.Google Scholar
- MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE. http://mvapich.cse.ohio-state.edu/.Google Scholar
- OFS for Xeon Phi. https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Mon_0422/2013_Workshop_Mon_1430_OpenFabrics_OFS_software_for_Xeon_Phi.pdf.Google Scholar
- TACC Stampede. http://www.tacc.utexas.edu/resources/hpc.Google Scholar
- TOP 500 Supercomputer Sites. http://www.top500.org.Google Scholar
- XEON-PHI Software Developer's Guide. http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/xeon-phi-software-developers-guide.pdf.Google Scholar
- J. Duato, A. J. Pena, F. Silla, J. C. Fernández, R. Mayo, and E. S. Quintana-Orti. Enabling CUDA Acceleration within Virtual Machines using rCUDA. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1--10. IEEE, 2011. Google ScholarDigital Library
- S. R. Garea and T. Hoefler. Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi. Jun. 2013. HPDC'13. Google ScholarDigital Library
- L. Koesterke, J. Boisseau, J. Cazes, K. Milfeld, D. Stanzione. Early Experiences with the Intel Many Integrated Cores Accelerated Computing Technology. In Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery, November 2012. Google ScholarDigital Library
- M. Deisher, M. Smelyanskiy, B. Nickerson, V. W. Lee, and M. Chuvelev. Designing and Dynamically Load Balancing Hybrid LU for Multi/many-core. In Computer Science - Research and Development, 2011. Google ScholarDigital Library
- L. Meadows. Experiments with WRF on Intel Many Integrated Core (Intel MIC) Architecture. In Lecture Notes in Computer Science, volume 7312, pages 130--139, 2012. Google ScholarDigital Library
- MPICH:High-performance and Portable MPI. http://www.mpich.org/.Google Scholar
- Open MPI: Open Source High Performance Computing. http://www.open-mpi.org.Google Scholar
- D. Pekurovsky. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions. SIAM Journal on Scientific Computing, 34(4):C192--C209, 2012.Google Scholar
- S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla and D. K. Panda. Efficient Intra-node Communication on Intel-MIC Clusters. In Int'l Symposium on Cluster, Cloud, and Grid Computing (CCGrid), May 2013.Google Scholar
- S. Potluri, K. Tomko, D. Bureddy and D. K. Panda. Intra-MIC MPI Communication using MVAPICH2: Early Experience. In TACC-Intel Highly Parallel Computing Symposium, April 2012.Google Scholar
- M. Si and Y. Ishikawa. An MPI Library Implementing Direct Communication for Many-Core based Accelerators. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 1527--1528. IEEE, 2012. Google ScholarDigital Library
- J. Wu, J. Liu, P. Wyckoff, and D. K. Panda. Impact of On-Demand Connection Management in MPI over VIA. In Proceedings of Cluster '02, Sep 2002. Google ScholarDigital Library
- S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong, and W. Feng. VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units. 2012.Google Scholar
Index Terms
- MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters
Recommendations
Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs
ICPP '13: Proceedings of the 2013 42nd International Conference on Parallel ProcessingGPUs and accelerators have become ubiquitous in modern supercomputing systems. Scientific applications from a wide range of fields are being modified to take advantage of their compute power. However, data movement continues to be a critical bottleneck ...
Efficient intra-node communication on Intel-MIC clusters
CCGRID '13: Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid ComputingAccelerators and coprocessors have become a key component in modern supercomputing systems due to the superior performance per watt that they offer. Intel's Xeon Phi coprocessor packs up to 1 TFLOP of double precision performance in a single chip while ...
The development of Mellanox/NVIDIA GPUDirect over InfiniBand: a new model for GPU to GPU communications
TG '11: Proceedings of the 2011 TeraGrid Conference: Extreme Digital DiscoveryThe usage and adoption of General Purpose GPUs (GPGPU) in HPC systems is increasing due to the unparalleled performance advantage of the GPUs and the ability to fulfill the ever-increasing demands for floating points operations. While the GPU can ...
Comments