research-article

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters

Authors:
Sreeram Potluri

The Ohio State University

The Ohio State University
View Profile

,
Devendar Bureddy

The Ohio State University

The Ohio State University
View Profile

,
Khaled Hamidouche

The Ohio State University

The Ohio State University
View Profile

,
Akshay Venkatesh

The Ohio State University

The Ohio State University
View Profile

,
Krishna Kandalla

The Ohio State University

The Ohio State University
View Profile

,
Hari Subramoni

The Ohio State University

The Ohio State University
View Profile

,
Dhabaleswar K. (Dk) Panda

The Ohio State University

The Ohio State University
View Profile

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2013Article No.: 54Pages 1–11https://doi.org/10.1145/2503210.2503288

Published:17 November 2013Publication History

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Pages 1–11

ABSTRACT

Xeon Phi, based on the Intel Many Integrated Core (MIC) architecture, packs up to 1TFLOPs of performance on a single chip while providing x86__64 compatibility. On the other hand, InfiniBand is one of the most popular choices of interconnect for supercomputing systems. The software stack on Xeon Phi allows processes to directly access an InfiniBand HCA on the node and thus, provides a low latency path for internode communication. However, drawbacks in the state-of-the-art chipsets like Sandy Bridge limit the bandwidth available for these transfers. In this paper, we propose MVAPICH-PRISM, a novel proxy-based framework to optimize the communication performance on such systems. We present several designs and evaluate them using micro-benchmarks and application kernels. Our designs improve internode latency between Xeon Phi processes by up to 65% and internode bandwidth by up to five times. Our designs improve the performance of MPI_Alltoall operation by up to 65%, with 256 processes. They improve the performance of a 3D Stencil communication kernel and the P3DFFT library by 56% and 22% with 1,024 and 512 processes, respectively.

References

InfiniBand Trade Association, InfiniBand Architecture Specification, Volume 1, Release 1.0. http://www.infinibandta.com.Google Scholar
Intel MIC Architecture. http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html.Google Scholar
Intel MPI Library. http://software.intel.com/en-us/intel-mpi-library/.Google Scholar
MVAPICH2: MPI over InfiniBand, 10GigE/iWARP and RoCE. http://mvapich.cse.ohio-state.edu/.Google Scholar
OFS for Xeon Phi. https://www.openfabrics.org/images/docs/2013_Dev_Workshop/Mon_0422/2013_Workshop_Mon_1430_OpenFabrics_OFS_software_for_Xeon_Phi.pdf.Google Scholar
TACC Stampede. http://www.tacc.utexas.edu/resources/hpc.Google Scholar
TOP 500 Supercomputer Sites. http://www.top500.org.Google Scholar
XEON-PHI Software Developer's Guide. http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/xeon-phi-software-developers-guide.pdf.Google Scholar
J. Duato, A. J. Pena, F. Silla, J. C. Fernández, R. Mayo, and E. S. Quintana-Orti. Enabling CUDA Acceleration within Virtual Machines using rCUDA. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1--10. IEEE, 2011. Google ScholarDigital Library
S. R. Garea and T. Hoefler. Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi. Jun. 2013. HPDC'13. Google ScholarDigital Library
L. Koesterke, J. Boisseau, J. Cazes, K. Milfeld, D. Stanzione. Early Experiences with the Intel Many Integrated Cores Accelerated Computing Technology. In Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery, November 2012. Google ScholarDigital Library
M. Deisher, M. Smelyanskiy, B. Nickerson, V. W. Lee, and M. Chuvelev. Designing and Dynamically Load Balancing Hybrid LU for Multi/many-core. In Computer Science - Research and Development, 2011. Google ScholarDigital Library
L. Meadows. Experiments with WRF on Intel Many Integrated Core (Intel MIC) Architecture. In Lecture Notes in Computer Science, volume 7312, pages 130--139, 2012. Google ScholarDigital Library
MPICH:High-performance and Portable MPI. http://www.mpich.org/.Google Scholar
Open MPI: Open Source High Performance Computing. http://www.open-mpi.org.Google Scholar
D. Pekurovsky. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions. SIAM Journal on Scientific Computing, 34(4):C192--C209, 2012.Google Scholar
S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla and D. K. Panda. Efficient Intra-node Communication on Intel-MIC Clusters. In Int'l Symposium on Cluster, Cloud, and Grid Computing (CCGrid), May 2013.Google Scholar
S. Potluri, K. Tomko, D. Bureddy and D. K. Panda. Intra-MIC MPI Communication using MVAPICH2: Early Experience. In TACC-Intel Highly Parallel Computing Symposium, April 2012.Google Scholar
M. Si and Y. Ishikawa. An MPI Library Implementing Direct Communication for Many-Core based Accelerators. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 1527--1528. IEEE, 2012. Google ScholarDigital Library
J. Wu, J. Liu, P. Wyckoff, and D. K. Panda. Impact of On-Demand Connection Management in MPI over VIA. In Proceedings of Cluster '02, Sep 2002. Google ScholarDigital Library
S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong, and W. Feng. VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units. 2012.Google Scholar

Index Terms

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters

Recommendations

Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs
ICPP '13: Proceedings of the 2013 42nd International Conference on Parallel Processing

GPUs and accelerators have become ubiquitous in modern supercomputing systems. Scientific applications from a wide range of fields are being modified to take advantage of their compute power. However, data movement continues to be a critical bottleneck ...
Read More
Efficient intra-node communication on Intel-MIC clusters
CCGRID '13: Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

Accelerators and coprocessors have become a key component in modern supercomputing systems due to the superior performance per watt that they offer. Intel's Xeon Phi coprocessor packs up to 1 TFLOP of double precision performance in a single chip while ...
Read More
The development of Mellanox/NVIDIA GPUDirect over InfiniBand: a new model for GPU to GPU communications
TG '11: Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery

The usage and adoption of General Purpose GPUs (GPGPU) in HPC systems is increasing due to the unparalleled performance advantage of the GPUs and the ability to fulfill the ever-increasing demands for floating points operations. While the GPU can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
InfiniBand
MIC
MPI
PCIe
RDMA
clusters
Qualifiers
- research-article
Conference

Acceptance Rates
SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 369
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs

Efficient intra-node communication on Intel-MIC clusters

The development of Mellanox/NVIDIA GPUDirect over InfiniBand: a new model for GPU to GPU communications