Elsevier

Parallel Computing

Volume 61, January 2017, Pages 35-51
Parallel Computing

Data movement optimizations for independent MPI I/O on the Blue Gene/Q

https://doi.org/10.1016/j.parco.2016.07.002Get rights and content

Highlights

  • We propose algorithm to route data to closer IO nodes and reduce network congestion.

  • Developed a route-aware strategy to modify existing bridge node assignments in BG/Q.

  • Our algorithm reduces write times by 60% on average than default independent I/O.

  • The default independent MPI I/O routes 1.4x more I/O messages than our approach.

  • By intra-node aggregation, our modified independent I/O obtains 10.5 GB/s bandwidth.

Abstract

Scalable high-performance I/O is crucial for application performance on large-scale systems. With the growing complexity of the system interconnects, it has become important to consider the impact of network contention on I/O performance because the I/O messages traverse several hops in the interconnect before reaching the I/O nodes or the file system. In this work, we present a route-aware and load-aware algorithm to modify existing bridge node assignment in the Blue Gene/Q (BG/Q) supercomputer. We reduce the network contention and reduce the write time by an average of 60% over the default independent I/O and by 20% over collective I/O on up to 8192 nodes on the Mira BG/Q system. Our algorithm routes 1.4× fewer messages through the bridge nodes which connect to the I/O nodes on the BG/Q. Our algorithm also reduces the average distance of a compute node from a bridge node, and thus lessens the network load, and decreases I/O time.

Introduction

Parallel applications need to store persistent data for a myriad of purposes - checkpoint, visualization, analysis etc. Therefore, the I/O subsystem is a crucial part of any cluster or supercomputer installation. Numerous efforts exist to optimize parallel I/O like data sieving and collective I/O optimizations [21], [22], [30], [33] and sophisticated distributed locking in parallel file systems such as GPFS [28]. I/O nodes or I/O servers have been introduced for I/O forwarding and I/O isolation [15], [17]. However, there has been less focus on the system interconnect through which the I/O messages are routed. With the advent of complex interconnects like dragon-fly and high-dimension torus [12], the messages being written out to the filesystem first traverse through the system interconnect to the I/O nodes en-route to the filesystem. The inverse order holds true for messages/data being read. I/O messages are routed through multiple hops in the interconnect which can cause congestion within the interconnect [15]. In this work, we advocate the importance of considering the interconnect for high-performance I/O. We need to carefully analyze any I/O bottlenecks present within the complex interconnects in order to further enhance the I/O performance.

Applications running on high-performance systems mainly use the optimized MPI-IO [30] for parallel reads and writes to the file system or use high-level libraries such as HDF5 which in turn use MPI-IO. MPI-IO has two flavors - independent I/O and collective I/O. On some platforms, collective I/O outperforms independent I/O [21], [22] due to the several optimizations in collective I/O related to file system access. In some cases, independent I/O may perform better than collective I/O [24], [25]. Many applications and benchmarks including S3D [13], FLASH [21], HACC [2], Energy2 [9], Turbulence3 [9], MADBench2 [8], HOMME [8] etc. have the option of using independent I/O [10]. The performance of independent vs. collective I/O for future burst buffers and NVRAM [16], [23] is not yet well studied.

In this work, we focus on independent I/O on Blue Gene/Q and show the efficacy of our algorithm over both collective and the default independent I/O. We have used the Mira supercomputer at Argonne National Laboratory. Mira is an IBM Blue Gene/Q system with 786,432 cores (49,152 nodes) and a 5D torus interconnect. A subset of the compute nodes, called bridge nodes [12], [19] are the gateway nodes for I/O traffic. Fig. 1 shows a schematic of the BG/Q compute node and I/O node architecture. Each compute node has 10 links for communication within the torus, each link of 2 GB/s bandwidth. Every compute node is assigned a bridge node. The bridge nodes connect to I/O nodes over an 11th serial I/O link at 2 GB/s. 2 bridge nodes connect to one I/O node. I/O nodes perform I/O on behalf of the compute nodes. Mira is configured with one I/O node per 128 compute nodes for I/O isolation [5].

The bridge nodes are selected during partition boot time depending on the partition in which the job is running. They are attached to an already booted I/O block [11], [17]. The deterministic/static routing order and the torus connectivity of the partition determine the paths taken by the I/O messages from a compute node to its bridge node [12]. The default path taken by the I/O messages in a machine like BG/Q with a complex interconnect topology does not ensure that the messages travel on paths with minimum congestion. We examine the route taken by I/O messages from a compute node to its bridge node for independent I/O using the BG/Q routing algorithm. We show the I/O paths from compute nodes which were assigned bridge node 98 on Mira in Fig. 2 for a 512-node job with 1 rank per node. The edge width indicates the number of messages being routed through that edge. It can be clearly seen that some paths are very heavily loaded as compared to others. This leads to congestion and hence poor I/O performance. The load will be higher for jobs with multiple ranks per node because multiple messages from a node will traverse on the same network links from a source to its destination due to static routing. In the default case, a bridge node may be present in the path from a compute node to its destination bridge node because assignments are not always to the closest bridge nodes. This results in more congested links.

In this paper, we present an algorithm for optimizing independent MPI-IO by reducing the network congestion. The default algorithm for determining a bridge node for a compute node assigns equal number of compute nodes per bridge node. Though this results in same number of I/O requests per bridge node, yet the physical network links have an uneven distribution of load (packets traveling on the link) due to deterministic routing. Furthermore, for some compute nodes, the distance (hops) to their respective bridge nodes is large due to inefficient bridge node assignments. This also results in a few highly congested links, due to multiple messages routed on a link, as shown in Fig. 2. This increases the time taken by I/O messages to reach the bridge nodes and hence causes I/O bottleneck. Though BG/Q has a low-latency high-bandwidth network, yet longer hops on congested links to the bridge nodes result in latency-bound I/O in some cases. We have developed an algorithm that reduces the average distance of a compute node from a bridge node, and thus lessens the network load, and decreases I/O time.

Our approach is to determine a different bridge node for some of the compute nodes. The objective is to find a bridge node that is closer to a compute node than its default bridge node. The closer a bridge node is to a compute node, the fewer the average distance traveled by I/O messages and lesser the overall network contention. We solve this problem by building a connectivity tree for each bridge node, and then traversing each tree using the global knowledge from all other trees. We not only assign a closer bridge node but also ensure that the number of new assignments per bridge node is similar, so that the load (number of I/O requests) is still balanced across all bridge nodes as in the default case.

The unbalanced network load results in huge variation in the I/O completion times per process. The time taken by 4096 processes to write a 4 GB file to the I/O nodes varies between 0.13 – 0.72 sec across the different processes using independent MPI-IO. Measuring the time to write to the I/O nodes helps understand interconnect-related bottleneck in isolation from file system-related performance bottlenecks. This huge variation is not only due to longer hops but also due to the routing order which results in many I/O messages routed along the same links. In our algorithm, we reduce this congestion by considering the network route taken by I/O messages while determining the new assignments. The network tree for a bridge node is built such that the path from any node in the tree to the bridge node is its default path to the bridge node. A default path is the path determined by the static routing order of BG/Q. Also, the paths in a tree are distinct from paths in other trees. Thus, when we assign a new bridge node to the nodes in its tree, the paths of I/O messages from these nodes to the bridge node are distinct from paths taken by nodes in other trees corresponding to other bridge nodes. Hence, there is less interference on a path in a tree from traffic to other bridge nodes and less overlap of messages on same links for the new assignments.

The overhead in our approach is the explicit send/receive of I/O data because the compute nodes are assigned a bridge node at boot time, which cannot be changed during runtime from user-space application. Despite this, our load-aware and route-aware algorithm improves independent I/O time by mitigating the bottlenecks in the network. We combine messages from multiple ranks in a node and send to the bridge nodes. Using our algorithm, with and without this coalescing technique, we not only improve write times to the file system over the default independent I/O, but also improve the write time to the bridge nodes over both independent and collective I/O. This will be very useful in future systems equipped with burst buffers which may be connected to few gateway nodes in the system [23], similar to the current I/O node architecture.

We study the impact of our algorithm on smaller and larger writes, ranging from 8 KB – 4 MB writes per MPI rank using independent MPI-IO. It is known that smaller independent writes have worse performance [10]. However, our results show an average improvement of 60% over the default independent MPI-IO on 512 – 131072 processes on Mira. Our optimized independent I/O also improves performance by an average of 20% over collective I/O while writing to the I/O nodes, data size ranging from 128 KB – 8 MB per MPI rank from 8192 – 131072 MPI processes. Our results also show that the default independent MPI-IO on Mira routes 1.4 × more I/O messages through the bridge nodes as a result of inefficient bridge node assignments. With intra-node aggregation, we achieved 10.5 GB/s write bandwidth at 4096 nodes writing 64 GB file, as compared to 0.7 GB/s with default MPI independent I/O.

The main contributions of our work are summarized below:

  • We present an algorithm to find better bridge node assignments for compute nodes via better routing and shorter hops to the bridge nodes.

  • We built a connectivity tree for each bridge node and confine the I/O traffic to a bridge node to its tree edges so that there is reduced network congestion.

  • We aggregate data at one or more cores on the node before sending to the respective bridge node cores, this led to lesser network congestion and coalesced accesses.

  • Our algorithm is able to route the I/O messages more efficiently than the default algorithm in BG/Q. Our algorithm gives 60% better performance on average than the default MPI independent I/O.

Section snippets

Related work

There has been a lot of work on I/O optimization, most of the efforts being for collective writes [14], [30], [32]. Chen et al. [14] propose physical data layout-aware rearranging and reordering of MPI-IO accesses. Wang et al. [32] propose reorganizing I/O requests within each file domain to reduce contention at the file system. These efforts optimize accesses to file systems which are optimized for aligned large-chunk block-level accesses [28], [29]. In this work, we focus on path of I/O

Route-aware and load-aware algorithm for independent I/O

In this section, we present our algorithm for optimized independent MPI I/O. In the default case of independent MPI I/O on Blue Gene/Q, all the MPI processes running on the compute nodes write their data through the default bridge nodes (a small subset of compute nodes). Every compute node has a statically assigned default bridge node for the duration of the job runtime. The bridge nodes are assigned at the boot time of the compute node partition(s). These bridge nodes are connected to the I/O

Results and discussion

We performed all our experiments on Mira, the BG/Q at Argonne National Laboratory. Mira is a 10 PF, 48-rack machine. Each compute node has 16 cores and 16 GB RAM. Mira has a GPFS filesystem with 24 PB of capacity and a peak I/O bandwidth of 240 GB/s. Compute nodes are connected to 384 I/O nodes which connect to the GPFS file system over a QDR Infiniband network. We used 512 – 8192 nodes for our experiments. We submitted our jobs to the default queue. The batch scheduler in Mira selects the

Conclusions

Improved I/O performance results in improved performance of I/O-intensive applications. In this work, we have shown that better routing in the interconnect affects I/O performance. We improved independent I/O performance on Blue Gene/Q by modifying the bridge node assignments in a route-aware approach. We show that it is important to consider the route of I/O messages within the interconnect before they reach the I/O nodes or the file system. We demonstrate at scale that by mitigating the

Acknowledgments

This research has been funded in part and used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. This work was supported in part by the DOE Office of Science, Advanced Scientific Computing Research, under award number 57L38, 57L32, 57K07 and 57K50. The authors would also like to thank Philip Heidelberger and Adam Scovel for their help with BG/Q

References (33)

  • Aurora,...
  • CORAL benchmarks,...
  • Cori,...
  • Graphviz,...
  • Mira,...
  • P. Balaji et al.

    Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems

    Comput. Sci. - Res. Dev.

    (2011)
  • A. Bhatele et al.

    There Goes the Neighborhood: Performance Degradation Due to Nearby Jobs

    Proc. of the International Conference on High Performance Computing, Networking, Storage and Analysis

    (2013)
  • P.H. Carns et al.

    24/7 Characterization of Petascale I/O Workloads

    Proc. of the First Workshop on Interfaces and Abstractions for Scientific Data Storage

    (2009)
  • P. Carns et al.

    Understanding and improving computational science storage access through continuous characterization

    Trans. Storage

    (2011)
  • P. Carns et al.

    Production I/O Characterization on the Cray XE6

    CUG2013

    (2013)
  • D. Chen et al.

    The IBM Blue Gene/Q Interconnection Network and Message Unit

    Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis

    (2011)
  • D. Chen et al.

    Looking Under the Hood of the IBM Blue Gene/Q Network

    Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

    (2012)
  • J.H. Chen et al.

    Terascale direct numerical simulations of turbulent combustion using S3D

    Comput. Sci. Discovery

    (2009)
  • Y. Chen et al.

    LACIO: A New Collective I/O Strategy for Parallel I/O Systems

    Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International

    (2011)
  • D.A. Dillow et al.

    Enhancing I/O Throughput via Efficient Routing and Placement for Large-scale Parallel File Systems

    Proc. of 30th IEEE International Performance Computing and Communications Conference

    (2011)
  • S. El Sayed et al.

    Using GPFS to Manage NVRAM-Based Storage Cache

  • Cited by (2)

    • A visual analytics system for optimizing the performance of large-scale networks in supercomputing systems

      2018, Visual Informatics
      Citation Excerpt :

      These metrics are useful to identify the bottlenecks. The bottlenecks may arise due to multiple, simultaneous communications passing through the same network links, or long-hop communications along congested links, or both (Bhatele et al., 2016; Bui et al., 2015; Malakar and Vishwanath, 2017). In order to help the user decide a time range of their interest, the behavior overview shows one selected statistical metric for each time point across time (DR1, DR4).

    • A Visual Analytics System for Optimizing Communications in Massively Parallel Applications

      2017, 2017 IEEE Conference on Visual Analytics Science and Technology, VAST 2017 - Proceedings
    View full text