Performance optimization and modeling of fine-grained irregular communication in UPC

The UPC programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory sub-systems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require fine-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the different scenarios, in form of quantifiable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. These performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh.


Motivation
We will focus on a specific category of matrix-vector multiplication operations where the involved matrix is sparse and has a constant number of nonzero values per row. Such a computational kernel appears in many branches of computational science. A performance challenge is that the nonzero values per matrix row can spread irregularly with respect to the columns. Consequently, any computer implementation will involve irregular and indirectly indexed accesses to numerical arrays. The resulting fine-grained irregular data accesses need to be handled with care, particularly in parallel implementations. We thus choose this computational kernel as a concrete case of fine-grained irregular communication that can occur in UPC code. We want to demonstrate that proper code transformations can be applied to a naive, programmer-friendly but inefficient UPC implementation, for obtaining considerable enhancements of the computing speed. Moreover, the obtained performance enhancements can be backed up by conceptually simple performance models.
The remainder of this paper is organized as follows. Section 2 explains the basic mechanism of shared arrays, the cornerstone of UPC programming. Then Section 3 gives a numerical description of our target computational problem and a naive UPC implementation. Thereafter, Section 4 shows in detail three programming strategies that transform the naive implementation for increasingly better performance. In Section 5, three performance models are developed to match with the three code transformations. Section 6 presents an extensive set of numerical experiments and time measurements for both showing the impact of the code transformations and verifying the performance models developed. Relevant related work is reviewed in Section 7, whereas Section 8 shows how our performance modeling methodology can easily be extended to simpler 2D calculations on a uniform mesh, before Section 9 concludes the entire paper with additional comments.

Shared arrays of UPC
Shared data arrays, whose elements are freely accessible by all the threads, typically constitute the main data structure of a UPC program. They thus deserve a separate introduction in this section. The most common scenario is that the elements of a shared array have an evenly distributed thread affinity. This gives a straightforward approach to data partitioning while providing a user-controllable mechanism of data locality. The standard upc all alloc function (see e.g. [28]), to be used by all the UPC implementations in this paper for array allocation, has the following syntax: shared void *upc_all_alloc(size_t nblks, size_t nbytes); Note that shared is a specific type qualifier. Also, upc all alloc needs to be collectively called by all threads to allocate a shared array. Upon return, a private pointer to the allocated shared array, or a private pointer-to-shared in the more rigorous UPC terminology, becomes available on each thread. The allocated shared array consists of nblks blocks in total, whose affinity is distributed evenly among the threads in a cyclic order. The value of nbytes is the number of bytes occupied per block, which translates the block size, i.e., number of elements per block, as nbytes/sizeof(one element). The blocks that have affinity to the same owner thread are physically allocated contiguously in the owner thread's local memory.
This data ownership distribution scheme, which in many cases determines an associated work partitioning, has the advantage of a simple mapping between a global element index and the owner thread ID. It can be described as follows: where THREADS is a built-in variable of UPC that stores the total number of threads participating in a parallel execution. The value of THREADS is fixed at either compile time or run-time.
Accessing the elements of a shared array by their global indices, although programmer friendly, can potentially incur considerable overhead. This is because a pointer-to-shared has three fields: the owner thread ID, the phase (i.e. the element offset within the affinity block) and the corresponding local memory address, see [13]. The standard upc threadof(shared void *ptr) function of UPC returns the owner thread ID of an element that is pointed to by the pointer-to-shared ptr. Every access through a pointerto-shared requires updating the three fields and thus always incurs overhead. Moreover, if the accessing thread is different from the owner thread, a behind-the-scene data transfer between the two threads has to be carried out. For indirectly indexed accesses of the elements in a shared array, a compiler cannot batch the individual between-thread data exchanges for the purpose of message aggregation (such as described in [9,8]). The individual between-thread data exchanges are thus particularly costly when the accessing thread runs on a different compute node than the owner thread.

SpMV and a naive UPC implementation
This section is devoted to explaining the target computational kernel of this paper and presenting a naive UPC implementation.

Definition of sparse matrix-vector multiplication
Mathematically, a general matrix-vector multiplication is compactly denoted by y = Mx. Without loss of generality we assume that the matrix M is square, having n rows and n columns. The input vector x and the result vector y are both of length n. Then, the general formula for computing element number i of the result vector y is as follows (using zero-based indices): If most of the M(i, j) values are zero, M is called a sparse matrix. In this case the above formula becomes unnecessarily expensive from a computational point of view. A more economic formula for computing y(i) in a sparse matrix-vector multiplicatin (SpMV) is thus which only involves the nonzero values of matrix M on each row. Moreover, it is memory-wise unnecessarily expensive to store all the n 2 values of a sparse matrix, because only the nonzero values are used. This prompts the adoption of various compact storage formats for sparse matrices, such as the coordinate format (COO), compressed sparse row format (CSR), compressed sparse column format (CSC), and the EllPack format [15]. In particular, for sparse matrices that have a fixed number of nonzero values per row, it is customary to use the EllPack storage format, which conceptually uses two 2D tables. Both tables are of the same size, having n rows and the number of columns equaling the fixed number of nonzeros per row. The first  table stores all the nonzero values of the sparse matrix, whereas the second table stores the corresponding   4 column indices of the nonzeros. Moreover, if we assume that all the values on the main diagonal of a sparse matrix M are nonzero, which is true for most scientific applications, it is beneficial to split M as where D is the main diagonal of M, and A contains the off-diagonal part of M. Then, a modified EllPack storage format can employ a 1D array of length n to store the entire main diagonal D. There is no need to store the column indices of these nonzero diagonal values, because their column indices equal the row indices by definition. Suppose r nz now denotes the fixed number of nonzero off-diagonal values per row. For storing the nonzero values in the off-diagonal part A, it is customary to use two 1D arrays both of length n · r nz (instead of two n × r nz 2D tables), one stores the nonzero off-diagonal values consecutively row by row, whereas the other stores the corresponding integer column indices. Following such a modified EllPack storage format, a straightforward sequential C implementation of SpMV is shown in Listing 1, where the integer array J contains the column indices of all nonzero off-diagonal values. The sparsity pattern of the M matrix, i.e., where its nonzeros are located, is described by the array J of column indices. The actual pattern is matrix dependent and irregular in general, meaning that each x(i) value is irregularly used multiple times in computing several values in the result vector y. This has an important bearing on the achievable performance of a typical computer implementation, because the actual sparsity pattern of the M matrix may affect the level of data reuse in the different caches of a computer's memory system. Additionally, for the case of parallel computing, some of the values in the x vector have to be shared between processes (or threads). The irregular data reuse in the x vector will thus imply an irregular data sharing pattern. The resulting communication overhead is determined by the number of process pairs that need to share some values of the x vector, as well as the amount of shared data between each pair. The impact of these issues on different UPC implementations of SpMV will be the main subject of study in this paper. In the following, we first present a naive UPC implementation, whereas code transformation strategies that aim to improve the performance will be discussed in Section 4.

A naive UPC implementation
The user-friendliness of UPC allows for an equally compact and almost identical implementation of the SpMV computational kernel (starting from the line of upc forall in Listing 2) as the straightforward C implementation in Listing 1. An immediate advantage is that parallelization is automatically enabled through using the upc forall construct of UPC, which deterministically divides the iterations of a forloop among the threads. The five involved data arrays, which are all allocated by upc all alloc as shared arrays, are evenly distributed among the UPC threads in a block-cyclic order. More specifically, the arrays x, y and D adopt a programmer-prescribed integer value, BLOCKSIZE, as their block size associated with the affinity distribution. The arrays A and J, both of length n · r nz , use r nz*BLOCKSIZE as their block size. This gives a consistent thread-wise data distribution for the five shared data arrays. The UPC implementation of SpMV shown in Listing 2 is clean and easy to code. The parallelization details, i.e., data distribution and work partitioning, are an inherent part of the language definition of UPC. Since the number of nonzeros per matrix row is assumed to be fixed, the adopted thread-wise data distribution for the shared D, A, J and y arrays is perfect, in that each thread will only access its owned blocks of these arrays. For the shared array x, whose values are indirectly accessed via the column index array J, underlying data transfers between the threads are inevitable in general.
As will be detailed later, the irregular column positions of the nonzero values (stored in the array J) will cause fine-grained and irregular between-thread data exchanges associated with the shared array x in the straightforward UPC implementation shown in Listing 2. Tuning the value of BLOCKSIZE can change the pattern and volume of between-thread communication. However, to ensure good performance, proper code transformations of such a naive UPC implementation are necessary.

Strategies of performance enhancement
This section studies three programming strategies that can be applied to transforming the naive UPC implementation. The main purpose is to reduce the impact of implicit between-thread data exchanges that are caused by irregular and indirectly indexed accesses to the shared array x. At the same time, we also want to eliminate some of the other types of avoidable overhead associated with UPC programming.

Explicit thread privatization
Some of the programmer-friendly features of UPC are accompanied with performance penalties. Relating to the naive UPC implementation of SpMV in Listing 2, these concern automatically dividing the iterations of a for-loop among threads by upc forall and allowing any thread to access any element of a shared array. See Section 2 about the latter.
The upc forall construct of UPC is a collective operation. In the example of upc forall (i=0; i<n; i++; &y[i]) used in Listing 2, all the threads go through the entire for-loop and check the affinity of each iteration, by comparing whether upc threadof(&y[i]) equals the built-in MYTHREAD value that is unique per thread. Although only iterations having an affinity equaling MYTHREAD are executed by a thread, it is not difficult to see the overhead due to excessive looping and calls to the standard upc threadof function behind the scene.
To get rid of the unnecessary overhead associated with upc forall, we can let each thread work directly with the loop iterations that have the matching affinity. Note that the affinity distribution of the i-indexed loop iterations can be easily determined using the value of BLOCKSIZE. Such an explicit thread-privatization of the loop iterations also opens up another opportunity for performance enhancement. Namely, all the globally indexed accesses to the shared arrays y, A, J and D (except x) can be replaced by more efficient accesses through private pointers and local indices. This is achievable by using the well-known technique of casting pointers-to-shared to pointers-to-local, see e.g. [32]. Following these two steps that can be loosely characterized as explicit thread privatization, the naive UPC implementation can be transformed as shown in Listing 3. It can be observed in Listing 3 that each thread now only traverses its designated rows of the sparse matrix. The computational work per thread is executed by going through its owned blocks in the shared arrays y, D, A and J, for which each thread is guaranteed to never touch blocks owned by the other threads. Pointers to these four shared arrays are cast to their local counterparts loc y, loc D, loc A and loc J, since array accesses through private pointers and local indices are the most efficient. On the other hand, casting pointer-to-shared x cannot be done, because the indirect accesses of form x[loc J[k*r nz+j]] may lead to situations where the accessing thread is different from the owner thread. Compared with the naive UPC implementation in Listing 2, the transformed version after explicit thread privatization will have a much better performance. However, further performance improvement can be obtained by also "privatizing" the global accesses to the shared array x. This can be achieved by two code transformations, of different programming complexities and performance gains, which are to be detailed below.

Block-wise data transfer between threads
Although Listing 3 improves the naive UPC implementation by the approaches of explicit thread privatization, each thread still indirectly accesses the elements of the shared array x through global indices that are stored in the array J (now cast to pointer-to-local loc J per block). When an indirectly indexed x[loc J[k*r nz+j]] value has affinity with MYTHREAD, the overhead only concerns updating the three fields of a pointer-to-shared. If MYTHREAD is different from the owner thread, however, a behind-the-scene data transfer will be executed in addition. Moreover, these between-thread data transfers will happen one by one, because a typical compiler is unable to batch the individual transfers. The extra overhead is particularly high if the owner and accessing threads reside on different compute nodes. To avoid the potentially high overhead associated with x[loc J[k*r nz+j]], we can create a private copy of x on each thread and transfer the needed blocks from x to the private copy before carrying out the SpMV. The resulting UPC implementation is shown in Listing 4. / * Allocation of the five shared arrays x, y, D, A, J as in the naive implementation * / // ... / * Allocation of an additional private x array per thread * / double * mythread_x_copy = (double*) malloc (n*sizeof(double)); / * Prep-work: check for each block of x whether it has values needed by MYTHREAD; make a private boolean array 'block_is_needed' of length nblks * / // .... / * Transport the needed blocks of x into mythread_x_copy * / , min ( BLOCKSIZE ,n -b* BLOCKSIZE )*sizeof(double)); / * SpMV: each thread only goes through its designated blocks * / int mythread_nblks = nblks / THREADS +( MYTHREAD <( nblks % THREADS ) ?1:0) ; for (int mb =0; mb < mythread_nblks ; mb ++) { int offset = ( mb * THREADS + MYTHREAD )* BLOCKSIZE ; / * casting shared pointers to local pointers * / double * loc_y = (double*) (y+ offset );

Listing 4: An improved UPC implementation of SpMV by block-wise communication
In Listing 4 we have used the one-sided communication function upc memget of UPC to transfer all the needed blocks, one by one, from the shared array x into a thread-private local copy named mythread x copy. The syntax of upc memget is as follows: void upc memget(void *dst, shared const void *src, size t n); We have thus completely avoided accessing values of the shared array x. However, there are a few "prices" paid on the way. First, each thread needs to allocate its own mythread x copy array of length n. This obviously increases the total memory usage. Second, the actual computation of SpMV on each thread needs to be preceded by transporting all the needed blocks from the shared array x into the corresponding places of mythread x copy. Specifically, a needed block from x is defined as having at least one x[loc J[k*r nz+j]] value that will participate in calculating the designated elements in y on each thread (with MYTHREAD as its unique ID). We note that each needed block is transported in its entirety, independent of the actual number of x values needed in that block. This also applies to the blocks of x that are owned by MYTHREAD. The whole procedure of transporting the needed blocks of x, implemented as the for-loop indexed by b in Listing 4, will result in time usage overhead. Nevertheless, this additional time usage is often compensated by avoiding the individual accesses to the shared array x. Third, to identify whether a block of x is needed by MYTHREAD requires pre-screening the designated blocks of the array J (not shown in Listing 4). This is typically considered a negligible "one-time" cost if the same sparse matrix, or the same sparsity pattern shared among several sparse matrices, is repeatedly used in many SpMV operations later. 8

Message condensing and consolidation
One shortcoming of the transformed UPC code shown in Listing 4 is that each needed block from x is transported in its entirety. This will lead to unreasonably large messages, when only a small number of values in a block of x are needed by MYTHREAD. Also, several messages may be transported (instead of one consolidated message) between a pair of threads, where each message has a rigid length of BLOCKSIZE. To condense and consolidate the messages, we can carry out a different code transformation as follows.

Preparation step
Each thread checks, in a "one-time" preparation step, which of its owned x values will be needed by the other threads. We also ensure that only one message is exchanged between each pair of communicating threads. The length of a message from thread T1 to thread T2 equals the number of unique values in the x blocks owned by T1 that are needed by T2. All the between-thread messages are thus condensed and consolidated. After this preparation step, the following private arrays are created on each thread: int *mythread num send values, *mythread num recv values; int **mythread send value list, **mythread recv value list; double **mythread send buffers; All the above private arrays have length THREADS (in the leading direction). If mythread num send values[T]> 0, it means that MYTHREAD needs to pack an outgoing message of this length for thread T as the receiver. Correspondingly, mythread send value list[T] points to a list of local indices relative to a pointerto-local, which is cast from &x[MYTHREAD*BLOCKSIZE], so that the respective needed x values can be efficiently extracted and packed together as the outgoing message mythread send buffers[T] toward thread T. The one-sided communication command upc memput, which is of the following syntax: void upc memput(shared void *dst, const void *src, size t n); will be used to transfer each outgoing message.
The meaning of mythread num recv values[T] applies to the opposite communication direction. Also, the content of mythread recv value list[T] will be needed by MYTHREAD to unpack the incoming message from thread T. One particular issue is that the upc memput function requires a pointer-toshared available on the destination thread. To this end, we need the following shared array with a block size of THREAD, where each array element is itself a pointer-to-shared:

shared[] double* shared [THREADS] shared recv buffers[THREADS*THREADS];
An important task in the preparation step is to let each thread go through the following for-loop to allocate the individual buffers, in UPC's globally shared address space, for its expected incoming messages: It should be noted that the standard upc alloc function should be called by only one thread. The entire array that is allocated by upc alloc has affinity to the calling thread while being accessible by all the other threads (see [28]). In the above for-loop, each thread (using its unique MYTHREAD value) only calls upc alloc inside its affinity block of shared recv buffers.

Communication procedure
When the preparation step described above is done, we need to invoke a communication procedure to precede each SpMV computation. The communication procedure first lets each thread (with MYTHREAD as its unique ID) pack an outgoing message for every thread T that has mythread num send values[T]> 0, by extracting the respective needed values from its owned blocks of the shared array x (cast to a pointerto-local), using the local indices stored in mythread send value list [T]. Then, the one-sided communication function upc memput is called to send every ready-packed outgoing message to its destination thread. Thereafter, the upc barrier command is posted to ensure that all the inter-thread communication is done, which means that all the expected messages have arrived on the respective destination threads. Finally, each thread unpacks every incoming message by copying its content to the respective positions in the thread-private array mythread x copy. Each thread also copies its owned blocks from the shared array x to the corresponding positions in the thread-privatemythread x copy. The entire communication procedure can be seen in Listing 5.

Implementation
By incorporating the above preparation step and communication procedure, we can create a new UPC implementation of SpMV in Listing 5. Specifically, each pair of communicating threads exchanges only one message containing the actually needed x values. As a "price", the new version has to introduce additional data structures in the preparation step, and involve message packing and unpacking in the communication procedure.

Performance models
We consider the performance model of a parallel implementation as a formula that can theoretically estimate the run time, based on some information of the target work and some characteristic parameters of the hardware platform intended. Roughly, the time usage of a parallel program that implements a computation comprises the time spent on the computational work and the parallelization overhead. The latter is mostly spent on various forms of communication between the executing processes or threads.
The three UPC implementations shown in Section 4 carry out identical computational work. However, they differ greatly in how the between-thread communication is realized, with respect to both the frequency and volume of the beween-thread data transfers. As will be demonstrated in Section 6, the time usages of the three transformed UPC implementations are very different. This motivates us to derive the corresponding performance models, with a special focus on modeling the communication cost in detail. Such theoretical performance models will help us to understand the actual computing speed achieved, while also providing hints on further performance tuning.

Time spent on computation
Due to a fixed number of nonzeros per matrix row, the amount of floating-point operations per thread is linearly proportional to the number of y(i) values that are designated to each thread to compute. For all the UPC implementations in this paper, the shared array y is distributed in a block cyclic manner, with a programmer-prescribed block size of BLOCKSIZE. Recall that the array y is of length n, thus the number of y blocks assigned per thread, B comp thread , is given by the following formula: Due to a low ratio between the number of floating-point operations and the induced amount of data movement in the memory hierarchy, the cost of computation for our SpMV example is determined by the latter, as suggested by the well-known Roofline model [29]. Our strategy is to derive the minimum amount of data movement needed between the main memory and the last-level cache. More specifically, the following formula gives the minimum data traffic (in bytes) from/to the main memory for computing each y(i) value: Here, r nz denotes the fixed number of off-diagonal nonzero values per matrix row, each occupying sizeof(double) bytes in memory, with sizeof(int) bytes needed per column index. The last term in (6) corresponds to the two memory loads for accessing loc D[k] and mythread x copy [offset+k] (or x[offset+k]) and the memory store associated with updating loc y[k]. We refer to Listings 3-5 for the implementation details. Formula (6) has assumed perfect data reuse in the last-level data cache. Our earlier experiences with the same SpMV computation (implemented in sequential C or OpenMP), for the case of a "proper" ordering of the matrix rows (see e.g. [18]), suggest that (6) is a realistic estimate for the last two UPC implementations (Listings 4 and 5). For these two implementations, the x values are fetched from the thread-private array mythread x copy. In the first transformed UPC implementation (Listing 3), indirectly indexed accesses to the shared array x (of form x[loc J[k*r nz+j]]) will incur additional memory traffic on "remote" threads, caused by the inevitable between-thread data transfers. We have chosen for this case to model the deviation from (6) as a part of the communication cost, to be discussed in Section 5.2.3. Therefore, the minimum computational time needed per thread, which is the same for all the UPC implementations of this paper, can be estimated as where W private thread denotes the realistic bandwidth (bytes per second) at which a thread can access its private memory space. This can be found by running a multi-threaded STREAM benchmark (see [21]) on one compute node of a target hardware platform, using the intended number of UPC threads per node. The W private thread value equals the measured multi-threaded STREAM bandwidth divided by the number of threads used. Note that the bandwidth measured by a single-threaded STREAM benchmark cannot be used directly as W private thread , unless a single UPC thread per compute node is indeed intended. This is because the multi-threaded STREAM bandwidth is not linearly proportional to the number of threads used, due to saturation of the memory bandwidth.

Definitions
Before we dive into the details of modeling the various communication costs that are associated with the three transformed UPC implementations, it is important to establish the following definitions: • If a thread accesses a memory location in the globally shared address space with affinity to another thread, a non-private memory operation is incurred.
• A non-private memory operation, which is between two threads, can belong to one of two categories: local inter-thread and remote inter-thread. The first category refers to the case where the two involved threads reside on the same compute node, which has a physically shared NUMA (or UMA) memory encompassing all the threads running on the node. The second category refers to the case where the two threads reside on two different nodes, which need to use some interconnect for exchanging data.
• A non-private memory operation, in each category, can happen in two modes: either individually or inside a sequence of memory operations accessing a contiguous segment of non-private memory.
We term the first mode as individual, the second mode as contiguous.

Cost of non-private memory operations
The time needed by one non-private memory operation, in the contiguous mode, can be estimated as where W local thread denotes the per-thread bandwidth for contiguous local inter-thread memory operations, and we assume for simplicity W local thread = W private thread , with the latter being defined in Section 5.1. Correspondingly, W remote node denotes the interconnect bandwidth available to a node for contiguous remote (inter-node) memory operations. The reason for adopting a per-node bandwidth for inter-node memory operations is because the inter-node network bandwidth can typically be fully utilized by one thread, unlike the mainmemory bandwidth. The value of W remote node can be measured by a modified UPC STREAM benchmark or simply a standard MPI ping-pong test, to be discussed in Section 6.2.
The cost of one individual remote inter-thread memory operation, T remote indv , is assumed to be dominated by a constant latency overhead, denoted by τ. Specifically, the latency τ is independent of the actual number of bytes involved in one individual remote memory operation. By the same reason W remote node has no bearing on T remote indv . The actual value of τ can be measured by a special UPC micro-benchmark, to be discussed in Section 6.2. The cost of one individual local inter-thread memory operation can be estimated by the following formula: Here, we will again adopt W local thread = W private thread . The reason for having the size of one cache line as the numerator in (9) is that individual local inter-thread memory operations are considered to be non-contiguously spread in the private memory of the owner thread, thus paying the price of an entire cache line per access. (It has been implied that one data element occupies fewer bytes than one cache line.)

Communication time for the first transformed UPC implementation
For the UPC implementation in Listing 3, an individual non-private memory operation arises when the owner thread of value x[loc J[k*r nz+j]] is different from the accessing thread. Each such non-private memory operation costs either T local indv as defined in (9) or T remote indv = τ. To quantify the total communication time incurred per thread, we need the following two counts, which can be obtained by letting each thread examine its owned blocks of the shared array J: • C local,indv thread : Number of occurrences when &x[loc J[k*r nz+j]] has a different affinity than MYTHREAD and the owner thread resides on the same compute node as MYTHREAD.
• C remote,indv thread : Number of occurrences when &x[loc J[k*r nz+j]] has a different affinity than MYTHREAD and the owner thread resides on a different compute node.
Thus, the total commuication cost per thread during each SpMV is

Communication time for the second transformed UPC implementation
For the UPC implementation in Listing 4, before computing the SpMV, each thread calls the upc memget function to transport its needed blocks from the shared array x to the private array mythread x copy. To estimate the communication time spent per node, we will use the following formula: Here, B local thread denotes the number of x blocks residing on the same node as MYTHREAD and having at least one value needed by MYTHREAD, whereas B remote thread denotes the number of needed blocks residing on other nodes. The reason for having a factor of 2 in the numerator of the first term on the right-hand side of (11) is due to the private/local memory loads and stores that both take place on the same node. Note that we have consistently assumed W local thread = W private thread . More importantly, we consider that all the threads on the same node concurrently carry out their the intra-node part of communication, whereas the inter-node operations of upc memput are carried out one by one. For communicating each inter-node block, we have included τ as the "start-up" overhead in addition to the W remote node -determined cost.

Communication time for the third transformed UPC implementation
For the UPC implementation in Listing 5, the overhead per thread for preparing the private array mythread x copy before the SpMV has four parts: (1) packing all its outgoing messages; (2) calling upc memput for each outgoing message; (3) copying its own blocks of x to the corresponding positions in mythread x copy; (4) unpacking the incoming messages. Let us denote by S local,out thread the accumulated size of the outgoing messages from MYTHREAD to threads residing on the same node as MYTHREAD, S remote,out thread denotes the accumulated size of the outgoing messages towards other nodes. Similarly, S local,in thread and S local,in thread denote the incoming counterparts. Then, the per-thread overhead of packing the outgoing messages is We remark that packing each value in an outgoing message requires loading at least sizeof(double) + sizeof(int) bytes from the private memory, and storing sizeof(double) bytes into the message.
where C remote,out thread denotes the number of outgoing inter-node messages from MYTHREAD. Again, for each inter-node message, we have included τ as the "start-up" overhead in addition to the W remote node -determined cost.
The per-thread overhead of copying the private blocks of x into mythread x copy is where we recall that B comp thread is defined in (5). Finally, the per-thread overhead of unpacking the incoming messages is Note that sizeof(double) + sizeof(int) corresponds to contiguously reading each value from an incoming message, whereas sizeof(cache line) corresponds to the cost of writing the value to a noncontiguous location in the array mythread x copy.

Total time usage
Due to the possible imbalance of both computational work and communication overhead among the threads, the total time usage of any of the UPC implementations will be determined by the slowest thread or node. For the first transformed UPC implementation, shown in Listing 3, the total time is determined by the slowest thread: For the second transformed UPC implementation, shown in Listing 4, the total time is determined by the slowest node: For the third transformed UPC implementation, shown in Listing 5, due to the needed explicit barrier after the upc memput calls, the total time usage is modeled as

Remarks
It is important to separate two types of information needed by the above performance models. The hardware-specfic information includes W private thread , W remote node , τ and the cache line size of the last-level cache. The first parameter denotes the per-thread rate of contiguously accessing private memory locations. The second parameter is the per-node counterpart for contiguously accessing remote off-node memory locations. Note that we do not distinguish between W private thread and intra-socket or inter-socket local memory bandwidths, due to very small differences between them. The τ parameter describes the latency for an individual remote memory access. All the hardware parameters are easily measurable by simple benchmarks, see Section 6.2, or known from hardware specification.
The computation-specific information includes C local,indv

Experiments
To study the impact of various code transformations described in Section 4 and to validate the corresponding performance models proposed in Section 5, we will use a real-world case of SpMV in this section.

A 3D diffusion equation solver based on SpMV
One particular application of SpMV can be found in numerically solving a 3D diffusion equation that is posed on an irregular domain. Typically, an unstructured computational mesh must be used to match the irregular domain. All numerical strategies will involve a time integration process. During time step , the simplest numerical strategy takes the form of an SpMV: where vectors v and v −1 denote the numerical solutions on two consecutive time levels, each containing approximate values on some mesh entities (e.g. the centers of all tetrahedrons). The M matrix arises from a numerical discretization of the original diffusion equation. Matrix M is normally time-independent and thus computed once and for all, prior to the time integration process. The unstructured computational mesh will lead to an irregular spread of the nonzeros. Particularly, if a second-order finite volume discretization is applied to a tetrahedral mesh, the number of off-diagonal nonzero values per row of M is up to 16, see e.g. [17].
Three test problems of increasing resolution will be used in this section. They all arise from modeling the left cardiac ventricle of a healthy male human. (The 3D diffusion solver can be an integral part of a heart simulator.) The three corresponding tetrahedral meshes are generated by the open-source TetGen software [26], with the actual size of the meshes being listed in Table 1. Note that we have r nz = 16 for all the three test problems. The tetrahedrons have been re-ordered in each mesh for achieving good cache behavior associated with a straightforward sequential computation. It is important to notice that all the three meshes are fixed for the following UPC computations, independent of the number of UPC threads used and the value of BLOCKSIZE chosen. For any computer program implementing the 3D diffusion solver, two arrays are sufficient for containing the two consecutive numerical solutions v and v −1 . For the UPC implementations discussed in Section 4, the shared array y corresponds to v and x to v −1 during each time step. The pointers-to-shared y and x need to be swapped before the next time step, fenced between a pair of upc barrier calls.

Hardware and software platforms
The Abel computer cluster [1] was used to run all the UPC codes and measure their time usage. Each compute node on Abel is equipped with two Intel Xeon E5-2670 2.6 GHz 8-core CPUs and 64 GB of RAM. The interconnect between the nodes is FDR InfiniBand (56 Gbits/s). With a multi-threaded STREAM [21] micro-benchmark in C, we measured the aggregate memory bandwidth per node as 75 GB/s using 16 threads. This gave W In order to measure the cost of an individual remote memory transfer, τ (defined in Section 5.2.2), we developed a micro-benchmark shown in Listing 6. Specifically, v is a shared UPC array created by upc all alloc. Each thread then randomly reads entries of v that have affinity with "remote threads", through a thread-private index array mythread indices. The total time usage, subtracting the time needed to contiguously traverse mythread indices, can be used to quantify τ. When using two nodes each running 8 UPC threads, we measured the value of τ as 3.4µs. Varying the number of concurrent threads does not change the measured value of τ very much. int nblks = n/ BLOCKSIZE + (n% BLOCKSIZE ) ?1:0; int mythread_nblks = nblks / THREADS + ( MYTHREAD <( nblks % THREADS ) ?1:0) ; shared [ BLOCKSIZE ] double *v = upc_all_alloc ( nblks , BLOCKSIZE *sizeof(double)); double tmp ; int * mythread_indices =(int*) malloc ( mythread_nblks * BLOCKSIZE *sizeof(int)); / * let array 'mythread_indices' contain random global indices with affinity to ''remote threads'' * / randomize ( mythread_indices , mythread_nblks * BLOCKSIZE ); / * start timing .... * / for (int mb =0 ,i =0; mb < mythread_nblks ; mb ++) for (int k =0; k < BLOCKSIZE ; k++ ,i ++) tmp = v[ mythread_indices [i ]]; / * stop timing .... * / Listing 6: A UPC micro-benchmark for measuring the latency of individual remote memory transfers Table 2 compares the performance of the naive UPC implementation (Listing 2) against that of the first transformed UPC implementation (Listing 3). Here, we only used one compute node on Abel while varying the number of UPC threads. Each experiment was repeated several times, and the best time measurement is shown in Table 2. Thread binding was always used, as for all the subsequent experiments. Test problem 1 (with 6810586 tetrahedrons) was chosen with the value of BLOCKSIZE being fixed at 65536. We can clearly see from Table 2 that the naive UPC implementation is very ineffective due to using upc forall and accessing y, D, A and J through pointers-to-shared.  Table 3 summarizes the time measurements for all the three transformed UPC implementations (denoted by UPCv1,2,3) and all the three test problems. It can be seen that UPCv3 has the best performance as expected, followed by UPCv2 with UPCv1 being the slowest. The only exception is that UPCv1 is faster than UPCv2 when running 16 UPC threads on one Abel node. This is because there is no "penalty" of individual remote memory transfers for UPCv1 in such a scenario, whereas UPCv2 has to transfer all the needed blocks in entirety.

Validating the performance models
We have seen in Section 6.3 that the three transformed implementations have quite different performance behaviors. To shed some light on the causes of the performance differences, we will now use the hardware characteristic parameters obtained in Section 6.2 together with the performance models proposed in Section 5. Specifically, Table 4 compares the actual time measurements against the predicted time usages for Test problem 1 on the Abel cluster. It can be seen that the predictions made by the performance models of Section 5 follow the same trends of the actual time measurements, except for the case of UPCv1 using 128 threads. It is worth noticing that the single-node performance (16 threads) of UPCv2 is correctly predicted to be slower than that of UPCv1, whereas the reverse of the performance relationship when using multiple nodes is also confirmed by the predictions. For small thread counts , the prediction accuracy is quite good. For larger threads counts, the predictions become less accurate. For UPCv1, there are four cases where the actual run times are faster than the predictions. These are attributed to the fact that the adopted τ value of 3.4µs can be a little "pessimistic". Recall from Section 6.2 that the particular τ value was measured by the micro-benchmark when it used 8 threads on one node to simultaneously communicate with 8 other threads on another node. In reality, the effective τ value can be smaller than 3.4µs, if the average number of remotely-communicating threads per node over time is fewer than 8. For UPCv3, there are two cases where the actual run times are slightly faster than the predictions. This is due to imbalance between the threads with respect to the per-thread amount of computation and message packing/unpacking. When most of the threads have finished their tasks, the remaining threads will each have access to an effective W private thread value that is larger than 1 16 of W private node . This can result in the time prediction of UPCv3 being a little "pessimistic". To examine some of the prediction details of UPCv3, we show in Figure 1 the per-thread measurements and predictions of T comp thread , T unpack thread , T pack thread (see Section 5.2.5), for the particular case of using 32 threads on two nodes. It can be seen that the predictions of the three time components closely match the actual time usages.
As mentioned in Section 4, the three UPC implementations differ in how the inter-thread communications are handled. To clearly show the difference, we have plotted in the top of Figure 2 the per-thread distribution of communication volumes for the specific case of using 32 threads with BLOCKSIZE set to 65536. We observe that UPCv3 has the lowest communication volume, whereas UPCv2 has the highest. Although UPCv1 induces lower communication volumes than UPCv2, all communications of the former are individual and thus more costly. It is also observed that the communication volumes can vary considerably from thread to thread. The specific variation depends on the spread of the nonzeros, as well as the number of threads used and the value of BLOCKSIZE chosen. The dependency on the last factor is exemplified in the bottom plot of Figure 2. This shows that tuning BLOCKSIZE by the programmer is a viable approach to performance optimization. The performance models are essential in this context.

Related work
Many performance studies about UPC programming, e.g. [12,30,5,25], selected kernels from the NAS parallel benchmark (NPB) suite [4]. These studies however did not involve irregular fine-grained com- munication that arises from indirectly indexing the elements of shared arrays. Other published non-NPB benchmarks implemented in UPC, such as reported in [5], had the same limitation. Various UPC implementations of SpMV were studied in [14], but the authors chose to combine a row-wise block distribution of the sparse matrix with duplicating the entire source vector x on each thread. Such a UPC implementation of SpMV completely avoided the impact of irregular fine-grained communication, which is the main focus of our present paper. The authors of [19] distributed the source vector x among the UPC threads, but their UPC implementation of SpMV explicitly avoided off-node irregular fine-grained communication, for which the needed values of x were transported in batches from the off-node owner threads using e.g. upc memget.
An extended suite of UPC STREAM micro-benchmarks, including various scenarios of using pointersto-shared, was proposed by the authors of [30]. They also reported measurements that clearly reveal the additional overhead due to UPC language features. For our purpose of performance modeling, we only found the so-called "random shared read" micro-benchmark defined in [30] to be useful for quantifying τ. This prompted us to write our own micro-benchmark (see Section 6.2) because we have no access to the source code used for [30].
One approach to alleviating the various types of overhead associated with using pointers-to-shared is by specialized UPC compilers, see e.g. [7,9,8,3]. However, indirectly accessing array elements through pointers-to-shared, which will induce irregular fined-grained communication, can not be handled by any of the specialized UPC compilers. Manual code transformations are thus necessary, such as those proposed in Sections 4.2-4.3 of our present paper. It is quite common that performance studies (such as [5]) and performance models (such as [31]) of UPC programs are built upon "single-value statistics" that is accumulated or averaged over all threads. We see this as an unignorable source of inaccuracy, because considerable variations in the communication volume and pattern may exist between threads, as exemplified by Figures 1 and 2. Ignoring such threadwise imbalances in communication will lead to inaccurate performance predictions, exactly in the same way as ignoring thread-wise imbalances in computation. At the same time, the performance models proposed in Section 5 of our present paper are kept to a minimalistic style, in that we only rely on four easily benchmarked/obtained hardware characteristic parameters: W private thread , W remote node , τ and the last-level cacheline size. This stands as a strong contrast to complicated performance modeling approaches such as in [19].

Performance modeling for a 2D uniform-mesh computation
Our performance modeling strategy in Section 5 was originally derived for the case of fine-grained irregular communication that is due to indirectly indexed accesses to shared arrays. In this section, we will show that the same methodology, as well as the same hardware characteristic parameters, are applicable to other scenarios. As a concrete example, we will use an existing UPC implementation that solves a 2D heat diffusion equation ∂φ ∂t = ∂ 2 φ ∂x 2 + ∂ 2 φ ∂y 2 on a uniform mesh. The UPC code was kindly provided by Dr. Rolf Rabenseifner at HLRS, in connection with a short course on PGAS programming [24]. The global 2D solution domain is rectangular, so the UPC threads are arranged as a 2D processing grid, with mprocs rows and nprocs columns. (Note THREADS equals mprocs*nprocs.) Each thread is thus identified by an index pair (iproc,kproc), where iproc=MYTHREAD/nprocs and kproc= MYTHREAD%nprocs. The global 2D domain, of dimension M × N, is evenly divided among the threads. Each thread is responsible for a 2D subdomain of dimension m × n, which includes a surrounding halo layer needed for communication with the neighboring threads. The main data structure consists of the following components: Note that the values in each shared array xphi[MYTHREAD] have affinity to the allocating thread, but are accessible by all the other threads. Each thread also allocates a private array named phin to store its portion of the numerical solution for a new time level, to be computed based on the numerical solution for the previous time level, which is assumed to reside in xphi[MYTHREAD].

Halo data exchange
The halo data exchange, which is needed between the neighboring threads, is realized by that every thread calls upc memget on each of the four sides (if a neighboring thread exists). In the vertical direction, the values to be transfered from the upper and lower neighbors already lie contiguously in the memory of the owner threads. There is thus no need to explicitly pack the messages. In the horizontal direction, however, message packing is needed before upc memget can be invoked towards the left and right neighbors. The following additional data structure is needed with packing and unpacking the horizontal messages:  (double)); halovec_coord1first = (double *) malloc((m-2)*sizeof(double)); halovec_coord1last = (double *) malloc((m-2)*sizeof(double)); Consequently, the function for executing the halo data exchange is implemented as follows:  (20) Comparing (19) with (12) from Section 5.2.5, we can see that the cost due to indirect array indexing (sizeof(int)) is no longer applicable. We have also assumed in (19) that reading/writing from/to noncontiguous locations in the array phi each costs a cache line. The value of S local,horiz thread in (19) denotes the total volume of local and remote messages, per thread, to be transfered in the horizontal direction. This can be precisely calculated when the values of m and n, as well as the thread grid layout, are known. Formula (20) is essentially the same as (13) from Section 5.2.5. It is commented that S local thread in (20) denotes the total volume of all local messages (in both horizontal and vertical directions) per thread, likewise denotes S remote thread the total volume of all remote messages, with C remote thread denoting the number of remote messages per thread. Putting them together, the total time spent on halo exchange intrinsic is modeled as 24 The time spent on computation during each time step (see Listing 8) is modeled as Here, we have assumed that every thread has exactly the same amount of computational work. The value of T comp 2D is estimated on the basis of the minimum amount of traffic between the memory and the last-level cache, see e.g. [27]. Table 5 shows a comparison between the actual time usages of the 2D UPC solver with the predicted values of T halo 2D and T comp 2D . We have used the same hardware characteristic parameters as in Table 4. It can be seen that the predictions of T comp 2D agree excellently with the actual measurements, while the prediction accuracy of T halo 2D is on average 72%.

Conclusion
Our starting point is a naive UPC implementation of the SpMV y = Mx in Section 3.2. This naive implementation excessively uses shared arrays, pointers-to-shared and upc forall. We have developed three increasingly aggressive code transformations in Section 4 aiming at performance enhancement. The transformations include explicit thread privatization that avoids upc forall and casts pointers-to-shared to pointers-to-local whenever possible, as well as removing fine-grained irregular communications that are implicitly caused by indirectly indexed accesses to the shared array x. The latter transformation is realized by letting each thread adopt a private mythread x copy array that is prepared with explicit one-sided communications (using two different strategies) prior to the SpMV computation. Numerical experiments of a realistic application of SpMV and the associated time measurements reported in Section 6 have demonstrated the performance benefits due to the code transformations. The performance benefits are also justified and quantified by the three performance models proposed in Section 5.
While the code transformations lead to improved performance, the complexity of UPC programming is increased at the same time. (Trading programmability for performance is by no means specific for our special SpMV example. The textbook of UPC [13] has ample examples.) The naive UPC implementation in Section 3.2 shows the easy programmability of UPC that is fully comparable with OpenMP, as discussed in e.g. [20]. The first code transformation, in form of explicit thread privatization shown in Section 4.1, may be done by automated code translation. The second and third code transformations, see Sections 4.2-4.3, are however more involved. The adoption of one mythread x copy array per thread also increases the memory footprint. Despite reduced programmability, all the UPC implementations maintain some advantages over OpenMP in targeting distributed-shared memory systems and promoting data locality. The third code transformation, UPCv3, results in a programming style quite similar to that of MPI. Nevertheless, UPCv3 is easier to code than MPI, because global indices are retained for accessing the entries of array mythread x copy. An MPI counterpart, where all arrays are explicitly partitioned among processes, will have to map the global indices to local indices. Moreover, one-sided explicit communications via UPC upc memget and upc memput functions are easier to use. Performance advantage of UPC's one-sided communication over the MPI counterpart has also been reported in e.g. [16]. On the other hand, persistent advantages of MPI over UPC include better data locality and more flexible data partitionings. A comparison of performance and programmability between UPC and MPI was given in [23] for a realistic fluid dynamic implementation. For a general comparison between OpenMP, UPC and MPI programming, we refer to [25].
It should be stressed that the SpMV computation is chosen for this paper as an illustrating example of fine-grained irregular communication that may arise in connection with naive UPC programming. The focus is not on the SpMV itself, but on the code transformations and the performance models in general. Moreover, our performance models are to a great extent independent of the UPC programming details, but rather focusing on the incurred communication style, volume and frequency. The hardware characteristic parameters W private thread , W remote node and the cacheline size are equally applicable to similar communications and memory-bandwidth bound computations implemented by other programming models than UPC. Even the latency of individual remote memory accesses, τ, can alternatively be measured by a standard MPI pingpong benchmark. Our philosophy is to represent a target hardware system by only four characteristic parameters, whereas the accuracy of the performance prediction relies on an accurate counting of the incurred communication volumes and frequencies. Accurate counting is essential and thus cannot be generalized, because different combinations of the problem size, number of threads, and block size will almost certainly lead to different levels of performance for the same parallel implementation.

Conflicts of interest:
The authors declare that there is no conflict of interest regarding the publication of this paper.