IMPI: Making MPI Interoperable

The Message Passing Interface (MPI) is the de facto standard for writing parallel scientific applications in the message passing programming paradigm. Implementations of MPI were not designed to interoperate, thereby limiting the environments in which parallel jobs could be run. We briefly describe a set of protocols, designed by a steering committee of current implementors of MPI, that enable two or more implementations of MPI to interoperate within a single application. Specifically, we introduce the set of protocols collectively called Interoperable MPI (IMPI). These protocols make use of novel techniques to handle difficult requirements such as maintaining interoperability among all IMPI implementations while also allowing for the independent evolution of the collective communication algorithms used in IMPI. Our contribution to this effort has been as a facilitator for meetings, editor of the IMPI Specification document, and as an early testbed for implementations of IMPI. This testbed is in the form of an IMPI conformance tester, a system that can verify the correct operation of an IMPI-enabled version of MPI.


Introduction
The Message Passing Interface (MPI) [6,7] is the de facto standard for writing scientific applications in the message passing programming paradigm. MPI was first defined in 1993 by the MPI Forum (http://www.mpi-forum.org), comprised of representatives from United States and international industry, academia, and government laboratories. The protocol introduced here, the Interoperable MPI protocol (IMPI), 1 extends the power of MPI by allowing applications to run on heterogeneous clusters of machines with various architectures and operation systems, each of which in turn can be a parallel machine, while allowing the program to use a different implementation of MPI on each machine. This is accomplished without requiring any modifications to the existing MPI specification. That is, IMPI does not add, remove, or modify the semantics of any of the existing MPI routines. All current valid MPI programs can be run in this way without any changes to their source code.
The purpose of this paper is to introduce IMPI, indicate some of the novel techniques used to make IMPI work as intended, and describe the role NIST has played Volume 105, Number 3, May-June 2000 Journal of Research of the National Institute of Standards and Technology in its development and testing. As of this writing, there is one MPI implementation, Local Area Multicomputer (LAM) [19], that supports IMPI, but others have indicated their intent to implement IMPI once the first version of the protocol has been completed. A more detailed explanation of the motivation for and design of IMPI is given in the first chapter of the IMPI Specification document [3], which is included in its entirety as an appendix to this paper.
The need for interoperable MPI is driven by the desire to make use of more than one machine to run applications, either to lower the computation time or to enable the solution of problems that are too large for any available single machine. Another anticipated use for IMPI is for computational steering in which one or more processes, possibly running on a machine designed for high-speed visualization, are used interactively to control the raw computation that is occurring on one or more other machines.
Although current portable implementations of MPI, such as MPICH [14] (from the MPICH documentation: The "CH" in MPICH stands for "Chameleon," symbol of adaptability to one's environment and thus of portability. ), and LAM (Local Area Multicomputer) [2] support heterogeneous clusters of machines, this approach does not allow the use of vendor-tuned MPI libraries and can therefore sacrifice communications performance. There are several other related projects. PVMPI [4] (PVMPI is a combination of the acronyms PVM, which stands for Portable Virtual Machine (another message passing system) and MPI) and its successor MPI-Connect [5], use the native MPI implementation on each system, but use some other communication channel, such as PVM, when passing messages between processes in the different systems. One main difference between the PVMPI/MPI-Connect interoperable MPI systems and IMPI is that no collective communication operations, such as broadcasting a value from one process to all of the other processes (MPI_Bcast) or synchronizing all of the processes (MPI_Barrier), are supported between MPI implementations. MAGPIE [16,17] is a library of collective communications operations built on top of MPI (using MPICH) and optimized for wide area networks. Although this system allows for collective communications across all MPI processes, you must use MPICH on all of the machines and not the vendor tuned MPI libraries. Finally, , is a version of MPICH developed in conjunction with the Globus project [9] to operate over a wide area network. This also bypasses the vendor tuned MPI libraries.
Several ongoing research projects take the concept of running parallel applications on multiple machines much further. The concept variously known as metacomputing , wide area computing , computational grids , or the IPG (Information Power Grid), is being pursued as a viable computational framework in which a program is submitted to run on a geographically distributed group of Internet-connected sites. These sites form a grid which provides all of the resources, including multiprocessor machines, needed to run large jobs. The many and varied protocols and infrastructures needed to realize this is an active research topic [9,10,11,12,13,15]. Some of the problems under study include computational models, resource allocation, user authentication, resource reservations, and security. A related project at NIST is WebSubmit [18], a web-based user interface that handles user authentication and provides a single point of contact for users to submit and manage long running jobs on any of our high-performance and parallel machines.

The IMPI Steering Committee Meetings
The Interoperable MPI steering committee first met in March 1997 to begin work on specifying the Interoperable MPI protocol. This first meeting was organized and hosted by NIST at the request of the attending vendor and academic representatives. All of these initial members (with one neutral exception) expressed the view that the role of NIST in this process would be vital. As a knowledgeable neutral party, NIST would help facilitate the process and provide a testbed for implementations. At this first meeting, only representatives from within the United States attended, but the question of allowing international vendors to participate was introduced. This was later agreed to and several foreign vendors actively participated in the IMPI meetings. All participating vendors are listed in the IMPI document (see Appendix).
There were eight formal meetings of the IMPI steering committee from March 1997 to March 1999, augmented with a NIST-maintained mailing list for ongoing discussions between meetings.
NIST has had three main roles in this effort: facilitator for meetings and maintaining an on-line mailing list, editor for the IMPI protocol document, and conformance testing. It is this last task, conformance testing, that required our greatest effort.

Design Highlights of the IMPI Protocols
The IMPI protocols were designed with several important guiding principles. First, IMPI was not to alter the MPI interface. That is, no user level MPI routines Volume 105, Number 3, May-June 2000 Journal of Research of the National Institute of Standards and Technology were to be added and no changes were to be made to the interfaces of the existing routines. Any valid MPI program must run correctly using IMPI if it runs correctly without IMPI. Second, the performance of communication within an MPI implementation should not be noticeably impacted by supporting IMPI. IMPI should only have a noticeable impact on communication performance when a message is passed between two MPI implementations (the success of this goal will not be known until implementations are completed). Finally, IMPI was designed to allow for the easy evolution of its protocols, especially its collective communications algorithms. It is this last goal that is most important for the long-term usefulness of IMPI for MPI users.
An IMPI job , once running, consists of a set of MPI processes that are running under the control of two or more instances of MPI libraries. These MPI processes are typically running on two or more systems . A system, for this discussion, is a machine, with one or more processors, that supports MPI programs running under control of a single instance of an MPI library. Note that under these definitions, it is not necessary to have two different implementations of MPI in order to make good use of IMPI. In fact, given two identical multiprocessor machines that are only linked via a LAN (Local Area Network), it is possible that the vendor supplied MPI library will not allow you to run a single MPI job across all of the processors of both machines. In this case, IMPI would add that capability, even though you are running on one architecture and using one implementation of MPI.
The remainder of this section outlines some of the more important design decisions made in the development of IMPI. This is a high-level discussion of a few important aspects of IMPI with many details omitted for brevity.

Common Communication Protocol
As few assumptions as possible were made about the systems on which IMPI jobs would be run; however some common attributes were assumed in order to begin to obtain interoperability.
The most basic assumption made, after some debate, was that TCP/IP would be the underlying communications protocol between IMPI implementations. TCP/IP (Transmission Control Protocol/Internet Protocol), is one of the basic communications protocols used over the Internet. It is important to note that this decision does not mandate that all machines running the MPI processes be capable of communicating over a TCP/IP channel, only that they can communicate, directly or indirectly, with a machine that can. IMPI does not require a completely connected set of MPI processes. In fact, only a small number of communications channels are used to connect the MPI processes on the participating systems.
The decision to use only a few communications channels to connect the systems in an IMPI job, rather than requiring a more dense connection topology, was made under the assumption that these IMPI communications channels would be slower, in some cases many times slower, than the networks connecting the processors within each of the systems. Even as the performance of networking technology increases, it is likely that the speed of the dedicated internal system networks will always meet or exceed the external network speed.
Other communications mediums, besides TCP/IP, could be added to IMPI as needed, for example to support IMPI between embedded devices. However, the use of TCP/IP was considered the natural choice for most computing sites.

Start-up
One of the first challenges faced in the design of IMPI was determining how to start an IMPI job. The main task of the IMPI start-up protocol is to establish communication channels between the MPI processes running on the different systems.
Initially, several procedures for starting an IMPI job were proposed. After several iterations a very simple and flexible system was designed. A single, implementation-independent process, the IMPI server , is used as a rendezvous point for all participating systems. This process can be run anywhere that is network-reachable by all of the participating systems, which includes any of the participating systems or any other suitable machine. Since this server utilizes no architecture specific information, a portable implementation can be shared by all users. As a service to the other MPI implementors, the Laboratory for Computer Science at the University of Notre Dame (the current developers of LAM/MPI), has provided a portable IMPI server that all vendors can use. The IMPI server is not only implementation independent, it is also immune to most changes to IMPI itself. The server is a simple rendezvous point that knows nothing of the information it is receiving; it simply relays the information it receives to all of the participating systems. All of the negotiations that take place during the start-up are handled within the individual IMPI/MPI implementations. The only information that the server needs at start-up is how many systems will be participating.
One of the first things the IMPI server does is print out a string containing enough information for any of the participating systems to be able to contact it. This string contains the Internet address of the machine running the IMPI server and the TCP/IP port that the server Volume 105, Number 3, May-June 2000 Journal of Research of the National Institute of Standards and Technology is listening on for connections from the participating systems.
The conversation that takes place between the participating systems, relayed through the IMPI server, is in a simple "tokenized" language in which each token identifies a certain piece of information needed to configure the connections between the systems. For example, one particular token exchanged between all systems indicates the maximum number of bytes each system is willing to send or receive in a single message over the IMPI channels. Messages larger that this size must be divided into multiple packets, each of which is no larger than this maximum size. Once this token is exchanged, all systems choose the smallest of the values as the maximum message size.
Many tokens are specified in the IMPI protocol, and all systems must submit values for each of these tokens. However, any system is free to introduce new tokens at any time. Systems unfamiliar with any token it receives during start-up can simply ignore it. This is a powerful capability that requires no changes to either the IMPI server, or to the current IMPI specification. This allows for experimentation with IMPI without requiring the active participation of other IMPI/MPI implementors. Once support for IMPI version 0.0 has been added to them, any of the freely available implementations of MPI, such as MPICH or LAM, can be used by anyone interested in experimenting with IMPI at this level. If a new start-up parameter appears to be useful, then it can be added to an IMPI implementation and be used as if it were part of the original IMPI protocol.
One particular parameter, the IMPI version number, is intended for indicating updates to one or more internal protocols or to indicate the support for a new set of collective communications algorithms. For example, if one or more new collective algorithms have been shown to enhance the performance of IMPI, then support for those new algorithms by a system would be indicated by passing in the appropriate IMPI version number during IMPI start-up. All systems must support IMPI version 0.0 level protocols and collective communications algorithms, but may also support any number of higher level sets of algorithms. This is somewhat different than traditional version numbering in that an IMPI implementation must indicate not only its latest version, but all of the previous versions that it currently supports (which must always include 0.0). Since all systems must agree on the collective algorithms to be used, the IMPI version numbers are compared at start-up and the highest version supported by all systems will be used. It is possible for an IMPI implementation to allow the user to control this negotiation partially by allowing the user to specify a particular IMPI version number (as a command-line option perhaps). The decision to provide this level of flexibility to the user is completely up to those implementing IMPI.

Security
As an integral part of the IMPI start-up protocol, the IMPI server accepts connections from the participating systems. In the time interval between the starting of the IMPI server and the connection of the last participating system to the server, there is the possibility that some other rogue process might try to contact the server. Therefore, it is important for the IMPI server to authenticate the connections it accepts. This is especially true when connecting systems that are either geographically distant or not protected by other security means such as a network firewall. The initial IMPI protocol allows for authentication via a simple 64 bit key chosen by the user at start-up time. Much more sophisticated authentication systems are anticipated so IMPI includes a flexible security system that supports multiple authentication protocols in a manner similar to the support for multiple IMPI versions. Each IMPI implementation must support at least the simple 64 bit key authentication, but can also support any number of other authentication schemes.
Just as the collective communications algorithms that are to be used can be partially controlled by the user via command-line options, the authentication protocol can also be chosen by the user. More details of this are given in Sec. 2.3.3 of the IMPI Specification document.
If security on the IMPI communication channels during program execution is needed, that is, between MPI processes, then updating IMPI to operate over secure sockets could be considered. Support for this option in an IMPI implementation could be indicated during IMPI start-up.

Topology Discovery
The topology of the network connecting the IMPI systems, that is, the set of network connections available between the systems, can have a dramatic effect on the performance of the collective communications algorithms used. It is not likely that any static collective algorithm will be optimal in all cases. Rather, these collective algorithms will need to dynamically choose an algorithm to use based on the available network. The initial IMPI collective algorithms acknowledge this in that, in many cases, they choose between two algorithms based on the size of the messages involved and the number of systems involved. Algorithms for large messages try to minimize the amount of data transmitted (do not transmit data more than once if possible) and algorithms for small messages try to minimize the latency by parallelizing the communication if possible (by

Conformance Tester
The design of the IMPI tester, which we will refer to simply as the tester, is unique in that it is accessed over the Web and operates completely over the Internet. This design for a tester has many advantages over the conventional practice of providing conformance testing in the form of one or more portable programs delivered to the implementors site and compiled and run on their system. For example, the majority of the IMPI tester code runs exclusively on a host machine at NIST, regardless of who is using the tester, thus eliminating the need to port this code to multiple platforms, the need for documents instructing the users how to install and compile the system, and the need to inform users of updates to the tester (since NIST maintains the only instance of this part of the tester). There are two components of the tester that run at the user's site. The first of these components is a small Java applet that is down-loaded on demand each time the tester is used, so this part of the tester is always up to date. Since it is written in Java and runs in a JVM (Java Virtual Machine), there is no need to port this code either. The other part of the tester that runs at the user's site is a test interpreter (a C/MPI program) that exercises the MPI implementation to be tested. This program is compiled and linked to the vendor's IMPI/MPI library. Since this C/MPI program is a test interpreter and not a collection of tests, it will not be frequently updated. This means that it will most likely need to be downloaded only once by a user. All updates, corrections, and additions to the conformance test suite will take place only at NIST. This design was inspired by the work of Brady and St. Pierre at NIST and their use of Java and CORBA in their conformance testing system [1]. In their system, CORBA was used as the communication interface between the tests and the objects under test (objects defined in IDL). In our IMPI tester, since we are testing a TCP/IP-based communications protocol, we used the Java networking packages for all communications.

Enhancements to IMPI
This initial release of the IMPI protocol will enable users to spread their computations over multiple machines while still using highly-tuned native MPI implementations. This is a needed enhancement to MPI and will be useful in many settings, such as within the computing facilities at NIST. However, several enhancements to this initial version of IMPI are envisioned.
First, the IMPI collective communications algorithms will benefit from the ongoing Grid/IPG research on efficient collective algorithms for clusters and WANs [12,16,17,20]. IMPI has been designed to allow for experimenting with improved algorithms by allowing the participating MPI implementations to negotiate, at program start-up, which version of collective communications algorithms will be used. Second, although IMPI is currently defined to operate over TCP/IP sockets, a more secure version could be defined to operate over a secure channel such as SSL (Secure Socket Layer). Third, start-up of an IMPI job currently requires that multiple steps be taken by the user. This start-up process could be automated, possibly using something like , in order to simplify the starting and stopping of IMPI jobs.
IMPI-enabled clusters could be used in a WAN (Wide Area Network) environment using Globus [9], for example, for resource management, user authentication, and other management tasks needed when operating over large distances and between separately managed computing facilities. If two or more locally managed clusters can be used via IMPI to run a single job, then these resources could be described as a single resource in a grid computation so that it can be offered and reserved as a unit in the grid.    Introduction to IMPI

Overview
There is a long experience in the message passing community of harnessing heterogeneous computing resources into one parallel message passing computation. This is useful for a variety of applications: some "embarrassingly parallel" applications may be able to utilize spare compute power in a large network of workstations; some applications may decompose naturally into components that are better suited to different platforms, e.g., a simulation component and a visualization component; other applications may be too large to fit in one system. Such applications can be developed using standard interprocess communication protocols, such as sockets on TCP/IP. However, these protocols are at a lower level than the message passing interfaces defined by MPI [1]. Furthermore, if each subsystem is a parallel system, then MPI is likely to be used for "intra-system" communication, in order to achieve the better performance that vendor MPI libraries provide, as compared to TCP/IP. It is then convenient to use MPI for "inter-system" communication as well.
MPI was designed with such heterogeneous applications in mind. For example, all message passing communication is typed, so that it is possible to perform data conversion when data is transferred across systems with different data representations. Indeed, there are several freely available implementations of MPI that run in a heterogeneous environment. These implementations use a common approach. An infrastructure is developed that provides a parallel virtual machine, on top of the multiple heterogeneous systems. Then, message passing is implemented on this parallel virtual machine. This approach has several deficiencies: The parallel virtual machine has to be implemented and supported on each underlying platform by a third party software developer. This poses a significant development and testing problem for such a developer, especially if it attempts to use faster but nonstandard interfaces for intra-system communication. So far, only academic development groups that have direct access to multiple platforms in supercomputing centers have been able to undertake such a development -it is hard to see a successful business model for such a product. In any case, this model implies that support for heterogeneous MPI always lags platform availability. Even though each system is likely to provide a native implementation of MPI for intra-system communication, the parallel virtual machine imposes an additional software layer, often resulting in reduced performance, even for MPI intra-system communication.
The MPI standard does not specify the interaction between MPI communication and TCP/IP communication; more generally, it does not specify the interaction between the MPI imple-

Conventions
In order to support heterogeneous networks a standard data representation is needed in order to initiate communication and transfer typed data.

Protocol Types
The following data types are defined and used in protocol packets:

Data Format
The user data transferred is packed at the source and unpacked at the destination using the external data representation "external32" standardized in MPI-2 (section 9.5.2).

Host Identifiers
Host identifiers used in messaging are 16 bytes long. Typically the host IP address is used as the host identifier. A 16-byte container is defined to accommodate the IPv6 protocol.
Advice to implementors. The following text highlights IPv6/IPv4 addressing issues. It is taken from RFC 2373 by R.Hinden & S.Deering: The IPv6 transition mechanisms include a technique for hosts and routers to dynamically tunnel IPv6 packets over IPv4 routing infrastructure. IPv6 nodes that utilize this technique are assigned special IPv6 unicast addresses that carry an IPv4 address in the low-order 32bits. This type of address is termed an "IPv4-compatible IPv6 address" and has the format:

Organization of this Document
This document is organized as follows: Chapter 2, Startup/Shutdown, describes the protocol used to initiate communication.
Chapter 3, Data Transfer Protocol, describes the protocol used to transfer data between two MPI implementations.
Chapter 4, Collectives, specifies the algorithms to be used in collective operations which span multiple MPI implementations Chapter 5, IMPI Conformance Testing, outlines the preliminary design for a Web-based IMPI conformance testing system. Startup/Shutdown

Introduction
One of the major hurdles to overcome in making different MPI implementations interoperate is launching MPI applications in a multiple-vendor environment. Because we can't encompass all working environments, we must make some basic assumptions about those environments for which interoperability might most reasonably be expected.

ASSUMPTIONS:
1. TCP/IP is available and in use on at least one computer within each implementation universe.
Rationale. TCP/IP need not necessarily be available on all computers which are to run MPI processes; we merely require that such machines be able to communicate with such a machine running under the local MPI implementation. (End of rationale.) 2. The use of rsh must not be assumed. However, all else being equal, those solutions which lend themselves nicely to rsh environments are preferable to those which do not. 3. The use of UNIX must not be assumed. However, all else being equal, those solutions which lend themselves nicely to UNIX environments are preferable to those which do not.

CONCLUSION:
host:port is the best convention to use for establishing initial connections between implementations 2.2 User Steps

Launching A Server
To launch a single job spanning multiple MPI implementations (with a common MPI COMM WORLD), a two-step process will be needed in general. The first step is to launch a 'server' process to be used as the rendezvous point for the different implementations. The name of the command used to start IMPI jobs (both the server and client) is implementation dependent; the name impirun is used throughout this document to represent this command. Regardless of the actual name, the command must be of the form: Here, <count> is the number of client connections that the server expects to see. When impirun is started with the '-server' option, it creates a TCP/IP socket for listening and then prints both the IP address of the local host (in standard dot notation) and the port number of the socket (to stdout, if on a UNIX machine). If the '-port' option is specified, the server will attempt to start on the given <port_number>. If the '-port' option is not given, the server is free to choose any port number.
Rationale. Printing the complete address instead of only the port number allows for an easy cut-and-paste of the output. And using the IP address instead of the hostname eliminates potential name-lookup problems. (End of rationale.)

Launching Clients
impirun -client <rank> <host:port> <cmd_line> <rank> specifies where the processes belonging to this client should be placed in MPI COMM WORLD relative to the other clients and must be a unique number between 0 and count ; 1, inclusive.
<host:port> is the host:port string provided by the server.

Examples
Given a machine named foo which will be the server, and two machines named bar and baz which will be the clients. The user wishes to run 8 copies of a.out on bar (with ranks 0 ; 7 in MPI -COMM WORLD) and 4 copies of b.out on baz (with ranks 8 ; 11 in MPI COMM WORLD). Advice to implementors. We do not mandate support for the '-np' syntax, this is simply common practice which we are using for the purpose of example. In general, anything following the host:port argument is completely implementation dependent (and may be quite complex).
(End of advice to implementors.) Rationale. The above design allows users with rsh support to write a single shell script to launch a job. For example, on most UNIX systems, the above could be rewritten as follows:

Security
Allowing any client on the Internet to establish a connection to the server process may make some users nervous. In light of the fact that security and authentication technology is ever-changing, IMPI is designed to have a modular and upgradable authentication scheme. This scheme is described in Section 2.3.3.
Rationale. Security has become a real concern for all users of the Internet. With the widespread popularity of network scanning tools, an open TCP/IP port on the server node is liable to be discovered by a malicious user, and potentially exploited (especially if the same port is used repeatedly). Many other meta-computing systems offer some form of authentication, ranging from a simple key to more complex protocols to protect against such occurrences.

Introduction
The IMPI server was designed to be as stupid as possible in order to provide maximum flexibility for future modifications to the clients. Basically it just collects opaque data from each of the clients, concatenates it all together, and broadcasts it back out again. Each client component of the full job can be broken down into two parts: procs and hosts. Procs are equivalent to MPI processes. Hosts are agents which control a set of procs. Every proc has Journal of Research of the National Institute of Standards and Technology exactly one host. A host might have only a single proc or it might have many; this is implementation dependent. For example, when running in a clustered SMP environment, there might reasonably be one host for each machine.
Hosts and procs need not exist on the same physical machine.

The IMPI Server
The server accepts commands of the following form: typedef struct { IMPI_Int4 cmd; // command code IMPI_Int4 len; // length in bytes of command payload } IMPI_Cmd; The cmd field tells the server which command is being sent, and the len field tells the server how many bytes of payload are about to follow. In this way, servers can be made forwardcompatible by simply discarding any command code that they do not understand.
Note that the traffic in both directions (i.e. client!server and server!client) is always tokenized.

Commands
The following cmd values are defined:

The AUTH command
A client must be authenticated to the server before sending any other commands; the AUTH command must be the first command sent to the server. After successful authentication, the client may continue the IMPI startup process. If the authentication is not successful, the server will terminate the client's connection.
Advice to implementors. The protocols that are outlined below, because they are designed to be flexible, may seem somewhat amorphous and non-intuitive when reading. There are comprehensive examples at the end of the description of each authentication protocol. (End of advice to implementors.) The client and server will each have multiple authentication mechanisms available. As such, they must negotiate and agree on a specific method before authentication can proceed. Two authentication mechanisms are currently mandated for all client and server implementations: the IMPI -AUTH NONE and IMPI AUTH KEY protocols. Rationale. In order to make the authentication mechanism universal between client and server, the list of standardized methods is enumerated below. It is expected that this list can be expanded in the future, both by user requests for specific forms of authentication, and also with advances in authentication technologies. (End of rationale.) Two factors determine whether an authentication method can be used in either the client or the server. First, the mechanism needs to be implemented in the software. Second, the mechanism needs to be enabled by the user. If both of these criteria are met, the authentication method is available for negotiation. If either the client or the server has no authentication mechanisms available for negotiation upon invocation, it will abort with an error message.
Since the client and server may have different authentication mechanisms available for negotiation, they must negotiate to decide on a common method to use. The client begins the negotiation by sending a bit mask of the authentication mechanisms available for negotiation to the server.
typedef struct { IMPI_Uint4 auth_mask; // Mask of which authentication // methods the client has available } IMPI_Client_auth; Advice to implementors. The auth mask is only 32 bits long. While this is probably enough to specify currently available authentication mechanisms, it is possible that it will become desirable to have more than 32 choices in the future. This can be implemented by having the client send multiple IMPI Client auth's, and change the value of len in the AUTH IMPI Cmd header. (End of advice to implementors.) For each client, the server will compare the client's available methods with its own, and choose the most preferable method that is supported by both. If no common method exists, the server will terminate the connection and display an error message. If a common mechanism exists, the server will inform the client which authentication method it wishes to use, optionally followed by any protocol-specific messages.
typedef struct { IMPI_Int4 which; // Which authentication will be used IMPI_Int4 len; // Length of follow-on [protocol-specific] // message(s) } IMPI_Server_auth; The which variable is the enumerated value of the authentication mechanism to be used, with the least significant bit of the first auth mask being 0, and the most significant bit being 31. The len variable indicates the length of any protocol-specific follow-on message(s) that may be sent by the server immediately after the IMPI Server auth message. A len of zero indicates that the server will not send any protocol-specific messages. All authentication messages after the IMPI Server auth message are protocol-dependent, and are detailed in the sections below.
The currently supported mechanisms are listed in Table 2.1. Their values are shown with their symbolic (i.e., #define) name, enumerated values (i.e., their corresponding which value), and their bit mask form (i.e., their corresponding auth mask value). The symbolic name is synonymous with the enumerated value.
On the command line of the server, the user can specify the order of preference of authentication methods. For example, IMPI AUTH NONE (if available for negotiation), should always be last in the order of preference. The following command line syntax will be used to specify the preference list:  Bit mask form  IMPI AUTH NONE  0  0x0001  IMPI AUTH KEY  1  0x0002   Table 2.1: List of standardized authentication methods, shown in enumerated and bit mask forms.

impirun -server N -auth <preference_list>
The <preference_list> is a comma separated list of which value ranges specifying the highest preference on the left. A range can be a single number or a hyphen-separated range of numbers. For example, to specify protocol three as the most preferable, followed by IMPI AUTH -KEY and IMPI AUTH NONE (in that order), the following syntax can be used: impirun -server N -auth 3,1-0 If no -auth flag is specified on the command line, the server may choose any authentication mechanism that is available for negotiation on both the client and server.
Advice to implementors. High quality server implementations will choose the "strongest" or "best" form of authentication when multiple authentication mechanisms are available, even if easier, less-secure methods are also available. (End of advice to implementors.)

IMPI AUTH NONE Protocol
Since some sites can guarantee the security of their networks (behind firewalls, etc.), no authentication is necessary. The IMPI AUTH NONE method is designed just for this purpose. The presence of the IMPI AUTH NONE environment variable allows the client (and server) to make this method available for negotiation.
If the IMPI AUTH NONE protocol is chosen, the which value sent to the client will be zero, and the len will also be zero. After the server sends the IMPI Server auth message, the authentication is considered successful; no further authentication messages are sent.
Advice to implementors. Even though the IMPI AUTH NONE protocol must be deliberately chosen by the user by setting the IMPI AUTH NONE environment variable, it is still a "dangerous" operation. A high quality implementation of the server should warn the user that a client has connected with IMPI AUTH NONE authentication by printing a message to the standard output (or standard error), that includes the network address of the connected client. (End of advice to implementors.)

Example Authentication Using IMPI AUTH NONE
The following command lines show two clients attempting to start. client1 sets the IMPI -AUTH NONE environment variable and invokes impirun on myprog. client2 does not set the IMPI AUTH NONE environment variable, and aborts since the user presumably did not make any other authentication mechanisms available.  The server also sets the IMPI AUTH NONE variable, and invokes impirun. After printing out its IP and socket numbers, it receives a connection from client1, and prints a warning message stating that IMPI AUTH NONE was used to authenticate. server% setenv IMPI_AUTH_NONE server% impirun -server 2 12.34.56.78:9000 Warning: client1.foo.com (12.34.56.78) has authenticated with IMPI_AUTH_NONE.
The messages exchanged by client1 and server were as follows: Client1 sends: IMPI_Cmd { IMPI_CMD_AUTH, 4 } Client1 sends: IMPI_Client_auth { 0x0001 } Server sends: At this point, the authentication is considered successful for client1.

IMPI AUTH KEY Protocol
The IMPI AUTH KEY protocol is a simplistic mechanism that involves the client sending a key to the server. If the client's key matches the server's key, the authentication is successful. If this method of authentication is desired, the value of the key is placed in the IMPI AUTH KEY environment variable. The presence of a value in this variable allows both the client and the server to make the IMPI AUTH KEY protocol available for negotiation.
If this protocol is chosen, the server sends a which value of one, and a len of zero back to the client. The client responds with the following message.

typedef struct {
IMPI_Uint8 key; // 64-bit authentication key } IMPI_Auth_key; If the client's key does not match the key on the server, the server terminates the connection. If the client's key does match, the fact that the server does not terminate the connection indicates a successful authentication.

Example Authentication Using IMPI AUTH KEY
The server sets the key value in the environment variable IMPI AUTH KEY: server% setenv IMPI_AUTH_KEY 5678 server% impirun -server 2 12.34.56.78:9000 The following command lines show two clients attempting to start. client1 sets the IMPI -AUTH KEY environment variable to the same value as the server, and invokes impirun on myprog. client2 sets the IMPI AUTH KEY environment variable to the wrong value, and aborts when the server terminates the connection. The IMPI command informs the server that this client wishes to join an IMPI job. For the client-to-server packet, the payload consists of the rank of the client. After every client in the job has connected to the server and sent its own rank, the server will send back to each client the size, that is, the total number of clients in the job. Example: Consider an IMPI job built from three clients: client 0 : 3 hosts, each with 2 processes per host client 1 : 2 hosts, each with 3 processes per host client 2 : 2 hosts, each with 4 processes per host The exchange of messages for the IMPI command is shown in Figure 2.1. Each client will first send a single IMPI Cmd containing the fields fIMPI CMD IMPI, 4g. The clients will then each send a single IMPI Impi, with the following fields: client 0 : f rank = 0 g client 1 : f rank = 1 g client 2 : f rank = 2 g ; ; @ @ @ @ @ @ @ @ I Passing rank to the server.

The COLL command
After the server replies to the IMPI commands from the clients, it is ready to start collecting other, opaque, startup information from them. This is done via the COLL command, which instructs the server to collect one payload from each of the clients and return the concatenation (in ascending client order) of all of them.
All COLL payloads sent from the clients to the server begin with the following struct: The label field marks the payloads as being of a certain kind; only buffers which share the same label will be concatenated by the server.
All COLL payloads sent from the server to the clients begin with the following struct: typedef struct { IMPI_Int4 label; IMPI_Int4 client_mask; } IMPI_Chdr; In addition to the label field, this struct contains the client mask field, which is a bitmask that identifies which clients have submitted values for this label. Bit i in the mask corresponds to the client with rank i (client rank as specified in the IMPI command, not an MPI COMM WORLD rank).
The following IMPI labels are currently defined (full explanations are provided below). C++ style comments are used for convenience.
To simplify server implementations, clients are required to pass labels in ascending numeric order. To simplify client implementations, servers are required to broadcast a concatenated buffer as soon as they receive complete sets of buffers from the clients.
Rationale. Ordering these labels allows the server to identify clients that have not implemented a particular label simply by observing the value of the current label sent from that client. Similarly, clients can ignore buffers they receive from the server with labels they do not understand.
The exact order of these labels is not particularly significant except that the IMPI C NHOSTS label must precede any per-host labels so that clients can correctly interpret the concatenated buffers they receive for the per-host labels. Similarly, the IMPI H NPROCS label must precede any per-proc label. (End of rationale.) Reserved labels. The following labels are reserved for future use.

IMPI NO LABEL
This label only exists for future use -it is not used in IMPI version 0.0. It is not sent to the IMPI server.

Client labels.
The following labels represent client information. The client sends an IMPI Version for each version of the IMPI protocols that it supports. All IMPI clients must support version 0.0. The length of the array of IMPI Version structures can be calculated from the payload len. Each clients' array of version numbers must be in strictly ascending order.
Each client chooses the highest major:minor version number that all clients support. Both the major and minor version numbers must match on all clients. This version number determines the nature and content for all future communication between the hosts of each client pair, and may also determine which labels the client will send to the server.

Rationale.
Exchanging an IMPI version number between the clients allows for newer protocols to be developed while still maintaining compatibility with older codes. For example, an IMPI version could mandate the minimal set of IMPI COLL labels to be recognized.
It is expected that after IMPI version 0.0 begins to be used by real applications, changes in the protocols will be suggested and adopted by the IMPI steering committee. This will change the IMPI version number. The major version number indicates large differences between protocols, while the minor version number indicates smaller changes (such as corrections) in the published protocols.
The IMPI version number should not be confused with a particular vendor's software version number. The IMPI version number indicates a published set of protocols, not a particular implementation of those protocols.
Forcing all clients to implement version 0.0 maximizes flexibility and potential for interoperability. (End of rationale.) For example, if the server broadcasts the following data for the IMPI C VERSION label:  IMPI_Uint4 major = 0 IMPI_Uint4 minor = 1 IMPI_Uint4 major = 0 IMPI_Uint4 minor = 2 All clients will use IMPI protocol version 0.1 since that is the highest version supported by all clients.
IMPI C NHOSTS Each client contributes an IMPI Uint4 that indicates the total number of hosts that it has. It must be 1.

IMPI C NPROCS
Each client contributes an IMPI Uint4 that indicates the total number of procs on all of its hosts. It must be 1.

IMPI C DATALEN
Each client contributes an IMPI Uint4 that indicates the maximum length, in bytes, of user data in a packet used for host-to-host communication. The smallest value specified by any client determines the value of IMPI Pk maxdatalen (see section 3.5 Message Packets). It must be 1.

IMPI C TAGUB
Each client contributes an IMPI Int4 that indicates the maximum tag value that will be used for host-to-host communication. Section 3.9 mandates some restrictions on this value.

IMPI C COLL XSIZE
Each client contributes an IMPI Int4 that indicates the minimum number of data bytes for which relevant collective calls will use "long" protocols. Relevant collective calls with data sizes less than this value will use "short" protocols.
Clients must provide a way for users to choose this value. If the user does not select a value, the client will contribute -1, indicating that that client wants the default value for this label. If the user does select a value (which must be 0), that value is sent. All clients must contribute the same value, or an error occurs.
The default value for IMPI C COLL XSIZE for IMPI version 0.0 is 1024.

IMPI C COLL MAXLINEAR
Each client contributes an IMPI Int4 that indicates the minimum number of hosts for which relevant collective calls will use logarithmic protocols. Relevant collective calls with fewer hosts than this value will use linear protocols. Clients must provide a way for users to choose this value. If the user does not select a value, the client will contribute -1, indicating that that client wants the default value for this label. If the user does select a value (which must be 0), that value is sent. All clients must contribute the same value, or an error occurs.
The default value for IMPI C COLL MAXLINEAR for IMPI version 0.0 is 4.
Host labels. The following labels represent host information. For

IMPI H IPV6
Each client's array contains the IPv6 addresses for its hosts. Each IPv6 address is a 128 bit quantity (16 bytes).

IMPI H PORT
Each client's array contains the TCP port numbers for its hosts. Each value is an IMPI Uint4.
IMPI H NPROCS Each client's array contains the number of procs on each of its hosts. Each value is an IMPI Uint4.
IMPI H ACKMARK Each client's array contains the IMPI Pk ackmark values for its hosts. Each value is an IMPI Uint4. Section 3.9 mandates some restrictions on this value.

IMPI H HIWATER Each client's array contains the IMPI Pk hiwater values for its hosts.
Each value is an IMPI Uint4. Section 3.9 mandates some restrictions on this value.

Proc labels.
The following labels represent proc information.

IMPI P IPV6
Each client's array contains the IPv6 addresses of the hosts on which each proc resides. Each IPv6 address is a 128 bit quantity (16 bytes).
Advice to implementors. This IPv6 address is only used for unique identification of a proc; it need not be the same as the IPv6 address for the host that the proc resides on. (End of advice to implementors.) IMPI P PID Each client's array contains identification numbers for its procs. Each value is an IMPI Int8, and must be unique among other procs that share the same IPv6 address.

Example: per-client labels
Consider the same three-client job from above. After receiving the concatenated IMPI buffer from the server, the clients now exchange their startup parameters.
The first parameter is the number of local hosts maintained by each client. In the above example, each client would first send an IMPI Cmd to the server containing fIMPI CMD COLL, 8g.
After the IMPI Cmd, the clients each send the IMPI C NHOSTS label followed by their local host count. (For a total of 8 bytes, thus the value of 8 in the len field of the command.) This exchange is shown in Figure 2.2. The server, upon receiving all of the IMPI C NHOSTS data, passes back the concatenated values to the clients: ; ; ; ; ; ; ; ; ; -@ @ @ @ @ @ @ @ R All of the per-client values follow this same pattern. For another example, assume that client 0 has a maximum data length of 8000, while clients 1 and 2 have a maximum data length of 4000.
They therefore first send an IMPI Cmd to the server containing fIMPI CMD COLL, 8g. After the IMPI Cmd, the clients each send the IMPI C DATALEN label followed by their local maximum data length value.
The server, upon receiving all of the IMPI C DATALEN data, passes back the concatenated values to the clients: Each client can then independently determine the global minimum data length. Similarly, the IMPI C TAGUB value will be used by the clients to determine the minimum tag ub among Journal of Research of the National Institute of Standards and Technology the clients. The server, being a passive collecting and broadcasting process, knows nothing of the meaning of any of these COLL labels.
Note: Some tags may be optional, and therefore may not be sent by all clients. When this happens, the server will zero the appropriate bit(s) in the client mask field and just concatenate the values from the participating clients. For example, if client 1 had not passed in a data length, the above would instead have been: This is sent to all clients, regardless of whether they submitted a value for this label. Clients are free to ignore non-mandated COLL labels for the IMPI protocol version(s) that they are using, as well as commands they do not understand.

Example: per-host labels
Again using the same three-client example, let's say that it is now time for the clients to submit the port numbers for their hosts. This exchange is shown in Figure 2

The FINI command
Like the DONE command, the FINI command contains no payload, so it should always have a len of zero. The server waits to receive the FINI command from all clients. A client must issue the FINI command after all of its procs successfully exit. The server does not generate any return traffic to the clients in response to the FINI commands. After receiving the FINI command from a client the server may close the socket to that client. A server-client socket that dies before the client sends the FINI command is an indication of an error which should be reported to the user; server behavior after this error is undefined. The server, after receiving a FINI from all clients, exits successful.

Shall We Dance?
The exchange of startup parameters is completed when the DONE command is received from the server. At this point, it may be necessary for additional socket connections to be established. The agents responsible for each socket must therefore participate in a connect/accept "dance", the order of which is defined as follows: /* * Higher ranked host connects, lower ranked host accepts. */ After a successful connect(), each host must send its rank as a 32-bit value to the accepting process.

Shutdown Wire Protocols
An IMPI job shuts down in the following order: procs, hosts, clients, and finally, the IMPI server. As per Section 4.24, MPI FINALIZE invokes a MPI BARRIER on MPI COMM WORLD. After this barrier, each proc performs implementation-specific cleanup and shutdown.
After all the procs of a host have completed (meaning that there will be no further MPI communications from each proc), the host will send an IMPI PK FINI packet to each other host (see Section 3.8), indicating that it is shutting down. The host must wait for the corresponding IMPI PK FINI from each other host before closing a communications channel. The host must consume arriving packets and issue appropriate acknowledgments until the IMPI PK FINI arrives. The host then performs implementation-specific cleanup and shutdown.
After all the hosts of a client have completed (meaning that all IMPI PK FINI messages have been sent, and all of the client's host communication channels have been closed), the client sends an IMPI CMD FINI message to the IMPI server. The client then performs implementation-specific cleanup and shutdown.
After the IMPI server receives an IMPI CMD FINI message from each client, it performs its own cleanup and shutdown.
Rationale. The sequence of events during IMPI shutdown is mandated to avoid race conditions and deadlock. (End of rationale.)

Client and Host Attributes
This section is heavily influenced by Section 5.3 of the MPI-2 Journal of Development, "Cluster Attributes." Inter-client and inter-host communications may be significantly slower than communications between processes on the same host. It is therefore desirable for programs to be able to determine which ranks are local and which are remote.
The following attributes are predefined on all communicators: IMPI CLIENT SIZE returns as the keyvalue the number of IMPI clients included in the communicator.
IMPI CLIENT COLOR returns as the keyvalue a number between 0 and IMPI CLIENT SIZE-1.
The value returned indicates with which client the querying rank is associated. The relative ordering of colors corresponds to the ordering of host ranks in MPI COMM WORLD.
IMPI HOST SIZE returns as the keyvalue the number of IMPI hosts included in the communicator.
IMPI HOST COLOR returns as the keyvalue a number between 0 and IMPI HOST SIZE-1. The value returned indicates with which client the querying rank is associated. The relative ordering of colors corresponds to the ordering of host ranks in MPI COMM WORLD.
Advice to users. This interface returns no information about the significance of the difference between the communication inside and between client/hosts members. However, this can be achieved by small application-specific benchmarks as part of the application. The returned color can be used as input to IMPI COMM SPLIT.

Introduction
This chapter specifies the protocol used to transfer data between two MPI implementations. The protocol assumes a reliable, ordered, bidirectional stream communication channel between the two implementations. The channel is assumed to have a finite but unspecified amount of buffering.
The protocol does not rely on the channel buffering for its operation. Processes on one side of the channel belong to the same MPI implementation. Two implementations communicate via a dedicated channel. Message routing (the selection of a channel to use for a particular message transfer) is not addressed here. It is assumed that an agent at the source determines the appropriate channel to use and directs the message to it. In essence, the data transfer protocol enables multiple processes to have timeshared access to a single communication channel, and provides mechanisms to throttle fast senders and cancel transferred messages. The protocol is defined independently of the underlying channel technology. Initially, TCP/IP is expected to be used by most implementations. Some implementations may opt for a restricted interoperability space and choose a different channel technology, while others may support multiple technologies. The protocol does not specify the interaction between processes and their agent nor the medium used (e.g. sockets, shared-memory). To provide generality of implementation, no restrictions are placed on the process/agent setup (e.g. shared access to socket, file descriptor passing). To support the MPI-2 client/server functionality, no parent/child relationship is assumed between processes and their agent.

Process Identifier
Messages exchanged between implementations are multiplexed in the channel. A system-wide unique process identifier is required to label the message source and destination. To support the MPI-2 client/server functionality, a decentralized mapping of processes to identifiers is chosen. The IMPI Proc process identifier is defined as the combination of a system-wide unique host identifier and a process identifier unique within the host: Solutions with a restricted interoperability scope may select other host identification methods. IMPI does not mandate p pid to be unique across all implementations within a given host. Thus IMPI does not guarantee interoperability between two implementations that share a host within a single MPI application.
Advice to implementors. Implementors are encouraged to use an OS-wide unique p pid identifier within a host, such as a UNIX pid. This would support IMPI host sharing in practice, and can be helpful for situations such as testing IMPI functionality. (End of advice to implementors.)

Context Identifier
A context identifier of type IMPI Uint8 is associated with every MPI communicator (intra-and inter-communicators). It has the following properties: It uniquely identifies a communicator within a process.
All processes within a communicator group use the same context identifier for that communicator.
The context identifier of MPI COMM WORLD is 0.
Advice to implementors. Mandating a collectively unique context ID may be a burden on some implementations that use memory addresses to segregate message contexts. Such implementations may choose to let the agent handle the mapping between context IDs and memory addresses and not impact the performance of the intra-implementation communication protocols. (End of advice to implementors.)

Message Tag
The message tag is of type IMPI Int4. MPI requires MPI TAG UB to be at least 32767. At startup time, the actual tag upper bound, IMPI Tag ub, is negotiated between the implementations.

Message Packets
MPI requires that messages of active requests be uniquely identified to allow for their cancellation. Requests that have been completed or are otherwise inactive cannot be canceled. As a result, an IMPI message is identified by its source and destination processes, and by a source request identifier unique for every active request at the source process. The request identifier is of type IMPI Uint8. The total message length is represented by a value of type IMPI Uint8. The sender's local rank in the communicator is given in the pk lsrank field of the header. This allows the receiving process to set the MPI SOURCE status entry without having to map pk src to its local rank.
Advice to implementors. On systems with 64-bit memory addressing or less, the address of the request object at the source process may be used as the unique identifier of an active request. On systems with wider memory addressing, the source process would need to maintain a mapping of active requests to identifiers. The source and destination processes are identified by their IMPI Proc structures instead of their local ranks in the communicator used. This gives implementors more freedom in the design of the internal agent protocol with respect to message buffering and matching messages to receive requests. For example, messages may be buffered and matched: in the agent, the agent acting as an MPI-aware "database".
in the receiving processes, the agent acting purely as a funnel. a mixture of both where the agent handles the buffering and matching on a destination process basis, and the receiving processes handle the buffering and matching of MPI tags and context IDs.

(End of advice to implementors.)
A message is divided into packets, each containing up to IMPI Pk maxdatalen bytes of user data. The IMPI Pk maxdatalen value is negotiated at startup time. All message packets are sent in the same channel, in sequential order. A fixed-length packet header, IMPI Packet, holds the message information and identifies the type of transfer: data packet (synchronous message or not), synchronization or protocol acknowledgment (ACK), cancel request, cancel reply (successful or not), or finalization. The maximum size of a packet is IMPI Pk maxdatalen plus the size of the IMPI Packet header.
In addition to identifying messages for cancellation, the source request identifier is used by the sender to access the request handle in the rendezvous message protocol. Similarly, an optional destination request identifier, of type IMPI Uint8, may be used to accelerate the receiving process's access to the request handle. Its usage and the required support by the peer agent is discussed in a later section.
Three optional quality-of-service fields are made available. They may be used by collaborating implementations to provide additional services, such as profiling or debugging. If used, pk count holds the message "count" argument (number of datatype elements), pk dtype is an opaque handle that uniquely identifies the sender's datatype within the process (e.g. the handle of the datatype object), and pk seqnum holds a sequence number that helps identify a message independently of its source request ID, which may be reused (e.g. a sequence number unique per sending process, per sending agent, per sender/receiver process pair, or per agent pair).   Advice to implementors. The choice of 4-or 8-byte integers in IMPI Packet is a trade-off between providing enough storage space where needed with some room for future extensions, keeping the structure size a power-of-two value (128 bytes in this case), and ordering the elements to avoid compiler padding. (End of advice to implementors.) The data packet is made of a header followed by up to IMPI Pk maxdatalen bytes of packed user data, and uses all the header fields. The four other packet types are header-only, and use a subset of the header fields. The list of fields used by each packet is given in table 3.1. Network byte-order is used in the header.

Packet Protocol
At the packet level, a simple throttling protocol is setup to limit the amount of buffering required and to prevent fast senders from affecting the message flow of other processes sharing the channel. This creates process-pair virtual channels. The number of virtual channels mapped onto a single channel is not fixed and can change according to the application's behavior. The communication agents are expected to handle the resulting change in buffering requirement. At startup time, two packet protocol values of type IMPI Uint4 are negotiated: IMPI Pk ackmark: The number of packets received by the destination process before a protocol ACK is sent back to the source.
IMPI Pk hiwater: The maximum number of unreceived packets the source can send before requiring a protocol ACK to be send back.
For each process-pair, the source maintains a packets-sent counter and the destination maintains a packets-received counter. The destination process sends a protocol ACK to the source process for every IMPI Pk ackmark packets it receives from that source. This decrements the

Packet Type
Fields Used data all fields (QoS fields optional) data sync.
all fields (QoS fields optional) sync. ACK pk type, pk src, pk dest, pk srqid, pk drqid protocol ACK pk type, pk src, pk dest cancel request pk type, pk src, pk dest, pk srqid cancel reply pk type, pk src, pk dest, pk srqid finalization pk type  source's counter by IMPI Pk ackmark. When the source's counter reaches the IMPI Pkhiwater value, it refrains from sending more packets to that destination until an ACK is received from it. The transfer of protocol ACK packets does not modify the value of the counters. The implementation is expected to provide sufficient buffering to receive the protocol ACK packets and to expedite their processing.
Advice to implementors. Depending on the implementation's internal process/agent protocol, packet counters can either be maintained by the processes or by the agent. (End of advice to implementors.)

Message Protocols
The data, synchronization, cancel request, and cancel reply packet types are used to construct the protocols for handling MPI point-to-point transfers. A message with length of up to IMPI Pkmaxdatalen bytes is categorized as a short message. It fits in a single data packet. Longer messages are split into several data packets. The IMPI PK DATASYNC packet type notifies the receiving process that the sender is expecting a synchronization ACK for the message. Otherwise, the IMPI PK DATA packet type is used.

Short-Message Protocol
Short messages are sent eagerly, relying on the packet protocol ACKs for flow control. The pk len and pk msglen fields have the same value. If the IMPI PK DATASYNC packet type is used, the destination process sends a synchronization ACK packet back to the source after it matches the message to a receive request. The pk srqid field in the ACK packet must be set to the value of the pk srqid field in the message packet. The sender must store the send request identifier in the outgoing packet. It receives it back in the ACK packet. This mechanism is used by the sending process to locate the request that matches the ACK packet.
Short messages generated by MPI SSEND and MPI ISSEND are mapped onto the shortmessage protocol with the IMPI PK DATASYNC packet type. All other short messages are mapped onto this protocol with the IMPI PK DATA packet type.

Long-Message Protocol
For long messages, the first data packet is sent eagerly, with the IMPI PK DATASYNC packet type. When the destination process matches the packet to a receive request, it sends a synchronization ACK packet back to the source process. The source can then send all remaining data packets with the IMPI PK DATA packet type.
The pk srqid field in the ACK packet must be set to the value of the pk srqid field in the message packet. The sender must store the send request identifier in the outgoing packet. It receives it back in the ACK packet. This mechanism is used by the sending process to locate the request that matches the ACK packet.
Likewise, the pk drqid field in the data packets sent after the ACK packet is received (that is all data packets except the first one) must be set to the value of the pk drqid field in the ACK packet. The receiving process may use this field to store a handle to the matching MPI receive Journal of Research of the National Institute of Standards and Technology request in the ACK packet, and receive it back in all the following data packets. This avoids having the receiving process search for the matching request for each remaining data packet.
Long messages generated by all MPI send calls are mapped onto the long-message protocol, independent of their blocking nature and synchronization requirements.

Message-Probing Protocol
Supporting the MPI PROBE and MPI IPROBE functions does not require special packet transfers. The protocol is purely local between the process and the agent. IMPI defines the conditions under which a message is considered to be available for the purpose of probing.
Packet transfers are considered atomic operations, independent of the medium's transfer mechanism. A message is considered available to the destination process after its first packet (its only packet for short messages) has been completely read by the agent, including the packet's user data segment. The total length of the message is available in the packet's pk msglen field. Thus the status information needed by the probe calls is available.

Message-Cancellation Protocol
For a send request, there is a time window during which a call to MPI CANCEL can cause a cancel request packet to be sent to the message destination. This can happen in the following cases: After a short non-synchronous message is sent.
After a short synchronous message is sent and before the synchronization ACK is received.
After the first packet of a long message is sent and before the synchronization ACK is received.
In all other cases, the MPI CANCEL call must be resolved locally. Once a cancel request is sent, a cancel reply packet must be returned, independently of whether a synchronization ACK for that message was already sent back. This allows MPI CANCEL to act as a simple RPC call, waiting for the reply, and simplifies the operations of the agents. If the message's first data packet has not been received by the destination process (i.e. matched a receive request), the agent sends a IMPI PK CANCELYES reply packet and atomically destroys the buffered packet. Otherwise, a IMPI PK CANCELNO packet is sent back. Note that due to the message ordering guarantee, a cancel request cannot be received without the agent having fully read the message's first packet. Thus the message can be in either of two states: buffered, or received by the destination process. The transition from a buffered state to a received state happens when the message's first packet matches a receive request, irrespective of the state of the remaining packets in the case of a long message.
For a given pair of processes, the sender's request identifier is used by the receiver to select the message to be canceled. The request identifier is unique among the sender's active requests. It is possible that multiple messages buffered at the receiver share the same sender request identifier. In such a case, only the last message received can be canceled, the other messages are no longer attached to the active request. This requires that the storage of unexpected messages be searchable in reverse chronological order.

Finalization Protocol
When an agent determines that its processes no longer require a channel for communication, it sends a finalization packet (IMPI PK FINI packet type) to notify the agent on the opposite side of the channel of its request to terminate the connection. An agent may not close the connection until it has sent an IMPI PK FINI packet to the other agent and received one from it. Until a finalization packet is received, an agent must continue to consume arriving packets and issue the appropriate acknowledgments; this effectively destroys unmatched messages. Acknowledgment are the only packets an agent may send on a channel after it issues a finalization packet. Because the finalization packet is exchanged between agents, it does not require buffering and thus does not affect the protocol ACK counters. The finalization protocol allows agents to distinguish between applications that terminate successfully and those that terminate abnormally (see section 2.4). It does not mandate error handling for the latter.
Advice to implementors. The protocol does not specify how an agent determines when its processes no longer need the channel. This is a local implementation-specific synchronization between MPI FINALIZE and the agent.
It is recommended that the agent not set the TCP/IP socket's SO LINGER option to a linger time of zero. If it is set to zero, the connection may be destroyed before the IMPI PK FINI packet reaches its destination. This causes the receiving agent to erroneously conclude that the application terminated abnormally.
The protocol does not specify the error handling an agent performs in cases of unmatched messages. It only requires that unmatched messages be destroyed. (End of advice to implementors.)

Mandated Properties
To support wide interoperability, IMPI requires the data transfer channel to be an Internet-domain socket. In addition, the following must be true: 1 IMPI Pk ackmark IMPI Pk hiwater 1 IMPI Pk maxdatalen 32767 IMPI Tag ub 2147483647

Introduction
This chapter specifies the algorithms to be used in collective operations on communicators which span multiple MPI clients.
A native communicator is defined to be a communicator in which all processes are running under the same MPI client.
IMPI places no restrictions on and does not specify the implementation of collective operations on native communicators.
Processes running under the same MPI client are defined to be local processes. Similarly communication between between processes running under the same MPI client is called local communication.
Communication between processes running under different MPI clients is referred to as nonlocal or global.
Many of the collective operations consist of one or more local and global phases of communication. An IMPI implementation is free to implement local phases in whatever manner it chooses but must implement the global phases as specified in order to properly interoperate with other implementations. No global communications may be done other than those explicitly specified.
In the specifications of the collectives great liberties have been taken with the cleaning up of temporary objects, e.g. intermediate groups. It is expected that implementors will add the necessary resource freeing. Additionally little specific error handling is specified. It is expected that implementors will check the return codes of MPI functions used and return appropriately on error.
Some of the collective operations require that data packed by one implementation be unpacked by another implementation. This requires that all implementations use the same format for such packed data. This leads to the following restriction on MPI Pack() in the case of non-native communicators. The format of data packed by a call to MPI Pack() with a non-native communicator is the wire (external32) format with no header or trailer bytes. In addition a call to MPI Pack size() with a non-native communicator will return as the size the minimum number of bytes required to represent the data in the wire (external32) format.
No restriction is placed on the behavior of MPI Pack() and MPI Pack size() when called with native communicators.

Utility functions
For a given communicator there is one master process per MPI client the communicator spans. The master process for a client is the process of lowest rank running under the client. Note that process rank 0 within a communicator is always a master process. The master processes in a communicator are numbered from 0 to (number of masters) ;1 in order of rank in the communicator.
For example consider the case of a communicator of size 8 which spans 3 clients (say A, B and C) with ranks 0,1,4 under client A, ranks 2,3,5 under client B and ranks 6,7 under client C. Then the master processes are ranks 0,2 and 6 and they are numbered 0,1 and 2 respectively.
The descriptions of the IMPI collectives make use of the following utility functions. Each implementation is free to implement them in whatever manner they see fit.
int is master(int r, MPI Comm comm) Returns TRUE iff process rank r in comm is a master process.
int are local(int r1, int r2, MPI Comm comm) Returns TRUE iff processes ranked r1 and r2 in comm are local to one-another.

int master num(int r, MPI Comm comm)
If process rank r in comm is a master process returns its master number else returns -1.

int master rank(int n, MPI Comm comm)
Returns the rank in comm of master number n.

int local master num(int r, MPI Comm comm)
Returns the master number of the master process local to process rank r in comm.

int local master rank(int n, MPI Comm comm)
Returns the rank in comm of the master process local to process rank r in comm.

int num masters(MPI Comm comm) Returns the number of master processes in comm.
int num local to master(int n, MPI Comm comm) Returns the number of processes in comm local to master process number n.

int num local to rank(int r, MPI Comm comm)
Returns the number of processes in comm local to process rank r in comm.

int *locals to master(int n, MPI Comm comm)
Returns an array containing the ranks in comm of the processes local to master number n.
int cubedim(int n) If n > 0 returns the dimension of the smallest hypercube containing at least n vertices (i.e. smallest i such that n < = 2 i ) else returns ;1.

Context Identifiers
Context IDs are of type IMPI Uint8 and are collectively unique. Collectively unique means the context IDs for a communicator are the same for each process in the communicator and no other communicator of which the process is a member has the same context IDs.
Advice to implementors. Mandating collectively unique context IDs may be a burden on some implementations that use memory addresses to segregate message contexts. Such implementations may choose to let the agent handle the mapping between context IDs and memory addresses and not impact the performance of the intra-implementation communication protocols. (End of advice to implementors.) Each communicator has two context IDs. One is used for point-to-point communication and the other for collective communication. The collective context ID is always one greater than the point-to-point context ID.
The point-to-point context identifier of MPI COMM WORLD is 0 and the collective context identifier is 1.
Many of the collective operations are defined in terms of point-to-point communications on a communicator. All point-to-point communications which occur inside collectives must use the communicator's collective context ID.
Advice to implementors. This can be done, for example, by passing a shadow "collective version" of the communicator to the point-to-point communication. (End of advice to implementors.)

Context ID Creation
When a new communicator is created it must be assigned collectively unique context IDs. Generating the new context ID is a collective operation over the communicator(s) from which the new one is being derived.
The basic mechanism for creating a new context ID is to first find the maximum context ID currently in use by any process involved in the new context creation. The new point-to-point context ID is then this maximum plus one and the new collective context ID is this maximum plus two.
In the descriptions of the collective algorithms which create new contexts it is assumed that each process keeps track of the maximum context ID it has in use in the variable IMPI Uint8 IMPI max cid; which is initialized to 1 in MPI Init().
Advice to implementors. Implementations are free to allocate other context IDs (e.g. for shadow communicators) but they must ensure that the value of IMPI max cid is correctly maintained.
In systems with limited context ID space the agent for each process can maintain a mapping between the limited space and the 64-bit IMPI space. (End of advice to implementors.)        /* allocate remote group array of ranks */ rranks = malloc(rgsize * sizeof(int)); /* leaders exchange rank arrays and broadcast them to their group */ if (lleader == myrank) { lranks = malloc(lgsize * sizeof(int)); fill local ranks array lranks with the ranks in MPI_COMM_WORLD of the process in lcomm; MPI_Sendrecv(lranks, lgsize, MPI_INT, pleader, tag, rranks,    MPI_Comm_rank(comm, &myrank); /* global phase */ if (myrank == root || (is_master(myrank, comm) && !are_local(myrank, root, comm))) master_bcast(buf, count, dtype, root, comm); /* local phase */ if (are_local(myrank, root, comm)) broadcast the data from the root to the local processes; else broadcast the data from the local master to the local processes;         /* For the linear gather to root, tmpbuf contains, concatenated in * order of master rank, the concatenations of the process send * buffers created in the first local phase. * For the tree gather to root, the order of these send buffers can be * circularly rotated by master rank number (skipping over the root, * which has been put directly in the root's receive buffer already    Journal of Research of the National Institute of Standards and Technology gather send buffers at the local master into tmpbuf; on local master set size equal to the # of bytes gathered in tmpbuf; /* At this point the master must have a buffer tmpbuf * containing a concatenation in rank order of the * local processes packed send buffers. */ } /* global phase */ if ((myrank == root) || (is_master(myrank, comm) && !are_local(myrank, root, comm))) {               MPI_Send(redbuf, count, dtype, peer, IMPI_REDUCE_TAG, comm); break; } /* a low-proc receives, reduces, and moves

Future work
The collectives are defined so that for each client there is one process, the master, which participates in the global communication phases. In cases where there are multiple hosts per client it is reasonable to expect that better performance may be obtained by having multiple processes, one per host, participating in the global communication phases. The current collectives can easily be extended to this model by changing the definition of master process so that there is one per host rather than one per client. In addition, to allow maximal exploitation of native communication, it may be necessary to modify some of the collectives so that local phases remain defined between all processes in a client rather than between all processes local to a host.

Summary
This chapter describes a Web-based IMPI conformance testing system. This testing system is intended to assist implementors of IMPI and to verify compliance of their implementation to the protocol defined in this document. Should a dispute occur over what the IMPI protocol specifies, the IMPI document, not this tester, should be considered the final word. This is a work in progress, comments and suggestions are welcome.
This tester is designed to verify the correct implementation of the IMPI specific protocols, not to test an MPI implementation against the MPI specifications. The full testing of MPI is well beyond the scope of this tester. The IMPI tester operates only within the C implementation of MPI. The testing of IMPI within a Fortran or C++ environment is not yet planned.
To help describe this testing environment, some conformance testing terminology will be used. The implementation of the IMPI protocol to be tested will be referred to as the Implementation Under Test (IUT). The IUT will be running on a System Under Test (SUT). The SUT refers to all the hardware and software needed to run the IUT. The person running the test will be called the tester.
The steps taken by a tester to run the IMPI conformance tests are outlined below. Figure 5.1 shows this testing scheme graphically. The numbers in this figure indicate the order in which the communications channels are established initially. The dashed lines are HTTP Web communication (connection-less communications, stateless). The solid lines are connection-oriented TCP/IP sockets. The dotted line is direct tester interaction with the SUT.
The Tester connects to the NIST IMPI web page at http://impi.nist.gov/IMPI. (Figure 5.2) with a Java-enabled browser. The current version of the IMPI standard (this document) is available from this web page.
Follow the IMPI Test Tool link to the main IMPI Test Tool page. This page gives a short description of the major parts of the IMPI testing software: the Test Interpreter, the Test Tool Applet, the Test Scripts, and the Test Server.
Follow the Test Interpreter link to obtain the current version of the test interpreter. The user must compile and link this interpreter for the IUT on the SUT before continuing with the tests.
Follow the Test Scripts link to view the source to the test scripts and descriptions of the available tests. Selecting the Run the IMPI Test Tool Applet link will initiate the IMPI Test Tool (a java client applet) which will be down-loaded and run on the tester's browser (see Figure 5.4). This applet establishes a permanent 1 TCP/IP connection with an instance of the IMPI Test Manager (the test server) on the NIST IMPI Web server. Test requests and results will be exchanged over this socket.
The tester can choose the configuration of clients, hosts, and processes to be tested through the Test Tool. This includes choosing where the impirun -server process will be run, how many clients there will be, how many of these clients will be on the NIST side (simulated clients), how many hosts will be on each simulated client, how many processes will be on each of these simulated hosts, and what rank to assign to the first simulated client.
This does not allow some valid configurations to be specified, such as varying the number of hosts per simulated client. If this becomes necessary for the proper testing of the IMPI protocol, then this interface will be modified to allow more general configurations.
Once the configuration of clients and hosts is specified, the IMPI Test Tool allows the tester to run the IMPI Startup protocol only, for early testing of an implementation. The Run Startup Only button on the Test Tool initiates this test. This option runs through the startup on each client with each process printing its rank to stdout for confirmation of the ordering within MPI COMM WORLD. Each process exits before starting the test interpreter. When running startup-only, the NIST side sends a trace of the progress of the startup protocol to the Test Tool so that errors in the Startup processing can be more easily identified. This tracing does not occur otherwise.
The MPI routines exercised under this option are MPI Init, MPI Comm size, MPI Comm rank, and MPI Finalize. Unfortunately, MPI Finalize is not a local operation in IMPI since it executes an MPI Barrier. So for startup-only to complete cleanly, the SUT must have this collective operation already implemented. This should still be useful to the tester early in the implementation, even if the tester hangs in the MPI Finalize call each time.
Errors that occur during the startup communications will be reported to the tester through the Test Tool and possibly also through stdout on the SUT.

The complete set of test scripts is made available if the Run Startup & Prepare for Scripts button is pressed instead of the Run Startup Only button.
Immediately after either the Run Startup Only or the Run Startup & Prepare for Scripts buttons has been pressed, A sample impirun command line is shown to the tester, in the scrolling output window on the Test Tool. This command line includes the specific HOST:PORT needed as a command-line parameter to the IMPI clients. The tester must then run the IMPI test interpreter on the SUT using this command-line, or the equivalent command-line if the IUT uses a different syntax.
In the future, a command-line template may be made available so that the tester can specify the required syntax for the IUT, with place holders for varying parameters such as the number 1 This connection will be disconnected after a predetermined timeout period of inactivity. Journal of Research of the National Institute of Standards and Technology of processes. This will allow the Test Tool to display the command-lines in the correct syntax for the IUT. This should help avoid errors in starting the IUT correctly for the various tests to be performed. Of course until then, the tester can write a script to parse the sample command line and call their own mpiexec or mpirun routine properly.  Figure 5.1: IMPI Test Architecture Figure 5.3 shows the levels of communications protocols involved in this testing. We are most interested in verifying the operation of the IMPI level of this communications stack so the tests originate in the MPI level (an upper-tester). We also have a lower-tester that is able to examine the IMPI protocol packets as they are received from the SUT. This lower-tester has been implemented by instrumenting the Model IMPI implementation directly. The Model IMPI implementation is part of the IMPI Test Manager.

Manager
Once the tester starts the IMPI test interpreter on the SUT, the IMPI startup protocol begins between the IUT and the Model IMPI implementation running on the NIST IMPI test server. If the startup protocol succeeds, the IMPI Test Manager will present to the tester a menu of options. The format used to display this menu will likely change as the number of available tests increases. If the startup protocol fails, error recovery will be attempted and as much information as possible about the startup negotiations will be provided.
Assuming the startup protocol succeeded, the tester may select a test or group of tests to be run. The Test Manager will then begin running the tests by sending each test, one at a time, to the processes running on the SUT. Each test will determine the result, either Pass/Fail/Indeterminate, and this result will be sent to the Test Tool for display (see Figure 5.4). Once all of the requested tests have been run, the tester may select more tests or discontinue testing. For each test, the tester may obtain, through the IMPI web page, the source for the script that is being interpreted. We might eventually also provide hyper-links into the IMPI or MPI documents for the text that specifies the part or parts of the protocol that this code is intended to test. To keep a permanent record of the results of the tests run, the Results window in the Test Tool (see Figure  5.4) can be mailed to the tester, or the contents of this window can be cut directly from the window and pasted elsewhere.
If you are running many tests, it helps performance to periodically clear the Results window. The buffering of unlimited amounts of output tends to slow down the Test Tool.
The full testing of the IUT must include varying the impirun command-line options. The command-line can specify the relative ranks of the SUT nodes within the MPI COMM WORLD group. Also, tests must be made using a single node of the SUT as well as multiple nodes. The tester must therefore follow instructions given in the Test Tool as to the proper impirun command-line to execute each time the test interpreter is started. Although it would be preferable to automate this, Journal of Research of the National Institute of Standards and Technology we do not yet, and probably will never, have this capability. However, we will at least provide the tester with command-lines in the Test Tool window to cut & paste.
Starting with Java SDK version 1.1, it will be possible to gain access to the disks on the machine running the Test Tool, as well as run external programs on that machine. This will be allowed if the tester certifies that the Test Manager is trusted. How this will be implemented in the browsers is not yet known. Many Web browsers in use are still using Java SDK 1.0, and those that have been updated to version 1.1 have not implemented this feature, so this is not yet an option. Even with this ability, it is not clear that we will be able to automate the running of impirun on the SUT.
Tests will also be provided to allow additional IMPI/MPI implementations to participate in the tests, in addition to the IUT. These tests will be needed to fully exercise the IMPI/MPI protocols.   Figure 5.4 shows a sample of the current interface. The Test Tool is shown here after running 2 collective communications tests, one for MPI Barrier() and one for MPI Bcast(). The green (or gray if you do not have a color copy of this document) highlighting of the test names indicate that the tests have passed. Tests that fail would be highlighted in red and tests that have indeterminate results would be highlighted in yellow. The most up-to-date instructions on using the tester can always be found on the IMPI web site at http://impi.nist.gov/IMPI/ImpiTIAIntro.html.

Test Interpreter
Tests are written as scripts that are interpreted within a standard C MPI program on the system under test. These test scripts are delivered to the MPI processes as arrays of characters via MPI Send(). The first test to be run each time the interpreter is started is a test that is embedded in the interpreter. This initial test verifies that this delivery mechanism is working. If this initial test fails, no other tests are attempted and diagnostic messages are sent to the Test Tool. For the special case of the Startup Only testing, this initial test of communications is not attempted.
Once the basic delivery of test scripts has been verified, the tester may select tests to be run. At this point, in the event of a test failure, it is most important to help testers diagnose the error. For this reason, the contents of each script will be available on request via the Test Tool. This section describes the operation of the interpreter and the language that it accepts so that the tester can read these scripts and hopefully diagnose the failure. Error messages produced by these scripts will also be more understandable given the test script that produced them.
The test interpreter is a C/MPI program that runs on the SUT. This program runs a loop that repeatedly receives and interprets small scripts that superficially look like standard C and MPI code. The interpreter repeats this loop until a script calls a special function which signals the interpreter to exit. The language that the interpreter understands is limited to a small subset of C with MPI function calls available. The MPI routines in the interpreter are wrappers to the actual MPI library routines in the IUT. Declarations for variables and single dimensioned arrays of the basic data types are allowed as well as control structures such as if, while, and for. In addition to the MPI routines, other routines are available such as printf() for printing to stdout and report() for passing information to the Test Manager. Details of the language accepted by this interpreter are available in the source code. A test is comprised of a block of statements in this language. Some sample tests are shown below.
The same interpreter, with some modifications, is executed within one or more simulated MPI processes which are running as part of the Test Manager. These simulated MPI processes execute the same script as the IUT. The MPI routines in this case are linked to routines, internal to the Test Manager, which implement the communications between the simulated processes as well as external MPI processes. Other modifications to the interpreter allow these simulated MPI processes to interact directly with the Test Manager, supplying test status and error information as the script is interpreted.
Upon completion of each test script, a short handshake protocol is executed. This synchronizes the processes between tests, although it is not a complete barrier. As part of this handshake, each process informs the master process of the level (integer) of the most severe error encountered during the execution of the current test script. In this context, the master process is the process with the lowest MPI COMM WORLD rank on the NIST side, and has no relation to the IMPI defined master processes. In this end-of-test handshake, the master process sends a "DONE" message to every rank. They recv and verify this and send back "done". The master rank receives and verifies each "done" message. See the end of test handshake() routine in sutInterp.c for more details. If a test script appears to have completed but the tester is hung, the processes may be stuck in this final handshaking protocol indicating that one or more processes have not made it to the end of the current test script.
To distinguish between the completion of the current test script and the completion of the final handshake, observe the following printouts: At the end of a test script, the message: END EXECUTING rank r Journal of Research of the National Institute of Standards and Technology will be printed, where r is the process rank. This is printed before the final handshake.
After the final handshake, the message:  Journal of Research of the National Institute of Standards and Technology nerrors, nprocs-1); } } } /* end of script */ For the preceding test to pass, MPI process rank 0 must receive a message from each other process with the tag and the single integer both matching the source process rank. This test will usually be run with rank 0 owned by the Test Manager, although this is not required. Tests may have requirements as to the number of processes and the ordering of them in MPI COMM WORLD. These restrictions must be enforced by the script itself when the tests are executed; the Test Manager has no specific information about these scripts to allow it to allow or deny any particular test.  Journal of Research of the National Institute of Standards and Technology correct_answer, answer); } else { report("Result: Pass"); } } } Each node in the preceding test fills an array and then calls the MPI Reduce() routine. The correct answer is computed, based on the number of nodes involved in the test, and the root node verifies that the correct value has been received. The Test Manager's simulated MPI processes actively participate in the algorithms used to perform this, and other, collective operations.
The interpreter must be able to execute test scripts that exercise all aspects of the IMPI protocols. Many test scripts have been written and more will be added as needed. The first tests implemented focussed on the IMPI start-up protocol and the basic MPI routines, MPI Init(), MPI Finalize(), MPI Send(), and MPI Recv(). As of this writing, this tester can also exercise MPI Sendrecv(), the non-blocking point-to-point routines MPI Isend() and MPI -Irecv(), MPI Iprobe(), all of the collective communications routines, as well as the communicator handling routines MPI Comm create(), MPI Comm dup(), MPI Comm split(), MPI Intercomm create(), and MPI Comm Intercomm merge(). The are over 100 test scripts currently available to exercise these routines. All scripts are available online at the NIST IMPI web site.

Test Manager
Note that the Test Manager itself knows very little about the test scripts and what they are designed to test. All of this information is implicit in the test scripts. These scripts are interpreted on all of the MPI processes including the Test Manager's simulated MPI processes.
On the NIST Web server, which runs the Test Manager, each test script is stored as a single ASCII file. During testing, these scripts are read from the disk and sent to each of the MPI processes as input to the interpreter. The Test Manager maintains a queue of requested tests and, in order to meet the configuration requirements of the requested tests, may prompt the tester as needed to restart the interpreter on the SUT. The interpreter must be restarted each time a set of tests is to be run which requires a different configuration of processes than the current configuration.
The Test Manager's version of the test script interpreter includes support subroutines, like report(), which accepts messages (text strings) from the simulated MPI processes and sends these messages on to the Test Tool applet for display, so that the tester can monitor the progress of the tests.