1 Introduction

Energy consumption has become an important concern in large scale computer systems, such as, servers and clusters for grid, cloud and high performance computing (HPC). Thus, designing energy efficient applications in the context of large scale HPC is an unavoidable challenge. In this paper, we investigate the energy footprint of a fundamental and memory bound kernel, the parallel prefix-sums. The parallel prefix-sums, or parallel scan, is a key building block in the implementation of many parallel algorithms.

Due to its low arithmetic intensity, prefix-sums exhibits a great amount of CPU stalls, rendering it a prime candidate for investigating and optimizing its energy behavior. Thus, we introduce CPPS algorithm, as well as a variety of optimized sequential prefix-sums kernels and thread placement policies, used as CPPS’s building blocks to optimize for performance and energy efficiency. Thanks to its better memory bandwidth utilization, CPPS outperforms other publically available implementations [1, 10, 12, 16].

To characterize the energy and performance behavior of CPPS, we investigate several configurations, i.e., sequential prefix-sums kernels, CPU frequency levels, number of threads and thread placement policies. The most important results of our experiments are: (1) significant energy savings, from 24 % to 55 %, and better energy reduction rate while descending to lower CPU frequencies when CPPS is configured with an optimized rather than a non-optimized sequential prefix-sums kernel in various different CPU frequencies and number of threads; (2) the optimized sequential prefix-sums kernels can improve CPPS’s performance from 24 % to 54 %, compared to a non-optimized sequential kernel when running on the maximum CPU frequency; (3) thread placement across NUMA nodes plays a significant role resulting in energy reductions of CPPS up to 47.5 % when placing threads under the same socket with almost negligible performance degradation, and up to 56.73 % when considering both sequential kernel and thread placement policy.

Notation. Henceforth, we refer to sequential prefix-sums kernel as SPK, to optimized sequential prefix-sums kernel as OSPK, and to thread placement policy as TPP.

The rest of this paper is organized as follows. Section 2 introduces CPPS and presents different implementations for its building blocks. Section 3 describes assumptions and issues concerning energy and performance. Section 4 shows the experimental results. Section 5 concludes the paper.

2 Implementation

The operation of computing all prefix-sums (partial sums) for an array, also called a scan, is a fundamental parallel algorithm’s building block [2, 9, 13]. Given an input array x of n elements, the problem is to compute the (inclusive) prefix-sums \(\oplus _{j=0}^ix[j]\), for all indices i, \(0\le i<n\), for a given, associative operation \(\oplus \). In the following, the prefix-sums will be computed in-place with the ith prefix-sum stored in x[i]. The parallel prefix-sums problem is a well-studied problem and many theoretical (PRAM) and practical ideas have been proposed and presented over the past 30 years. Many implementations for x86 systems and GPUs are publically available [1, 6, 8, 1012, 15, 16], but none of them address energy aspects.

2.1 CPPS Algorithm

We introduce CPPS (cache-aware parallel prefix-sums) algorithm, as shown in Algorithm 1. It has been built on Pthreads for x86 shared memory systems. We chose Pthreads over other parallel frameworks, like OpenMP, mostly due to the flexibility of thread manipulation. The algorithm runs over an array x of n elements with a predetermined associative operator OP. CPPS consists of three phases in which the array x is divided in chunks of size Achunk \(*p+ \) Bchunk and processed in parallel by p threads, where Achunk corresponds to a segment of elements assigned to every thread (first phase) and Bchunk corresponds to an additional segment assigned only to thread \(t=0\) (third phase). The size of Achunk is architecture specific since it is mostly based on LLC’s (last level cache) size, main memory latency and number of active cores per socket. The Achunk must be lower than the system’s LLC size and be bigger than the size of the next cache level (ex. L2). Otherwise, fetching chunk-sizes of L2- or L1-size entails faster computation of a CPPS’s cycle (all phases) and more frequently being affected by the main memory latency. In addition, the size of Bchunk is dependent on the size of Achunk. In this paper, we do not cover how to find the best values for these parameters.

Both the first (lines 6–7) and the second phase (lines 9–12) are common for all the threads working on a distinct Achunk. In the last phase (lines 15–21), thread \(t=0\) runs prefix-sums sequentially (line 17) on a Bchunk segment while the other threads still work on their own segment. Intuitively, CPPS improves the parallelism of the last phase by assigning to thread \(t=0\) another segment.

In more details, in the first phase, each thread is assigned a Achunk segment on which it invokes a sequential prefix-sums, called seq_kernel, from s, corresponding to the begin of the Achunk, and for number of elements nelems. In a next section, we discuss different SPKs that we can apply. In the second phase, each thread passes to the shared array, last_elems, the last computed element of its own Achunk and synchronizes with the others at the barrier point. Subsequently, after exiting the barrier each thread performs a partial reduction on the last_elems from 0 to t. Only thread \(t=0\) reduces the entire array. At the last phase, thread \(t=0\) is assigned a Bchunk and runs the seq_kernel. The other threads propagate the result from phase 2 to their elements.

figure a

2.2 Sequential Prefix-Sums Kernels

One of the building blocks of CPPS and of the other publically available algorithms is a SPK. As far as we know, the other public implementations use the trivial SPK (Algorithm 2). The one used in Intel’s SHOC benchmark suite [8] is an exception since it has been designed to exploit the wide vector registers of Intel’s Xeon Phi accelerator. We show empirically later that trivial_seq is not the most performance/energy efficient SPK, mostly due to its data dependencies. For that reason, we have identified and implemented different categories of SPKs, based on the ideas of Chatterjee et al. [4] applied on register-based vector computers, leveraging several architectural features, such as, vectorization and hardware prefetching. Their main goals are: (1) maximizing instruction parallelism by vectorization and (2) improving pipelining by breaking data dependencies. The following categories consists of algorithms working on x, an array of n elements, and an associative operator OP.

figure b

The first category is comprised of two template algorithms, the vectorized seq1 (Algorithm 3) and the non-vectorized seq2 (Algorithm 4); both reuse data by chunking. In order to break the data dependencies we exploit the associativity property of prefix-sums operation. We execute independent prefix-sums operations by accessing different parts of x. To this end, x is split in chunks of size seg (line 2). The seg is determined at compile time by setting values for V, V2, A2 and A, where V corresponds to a vector register size (ex. SSE-width, AVX-width, etc.) and A to a multiple of V. V2 and A2 are not linked to vector features but to an alternative way of segmenting and chunking x. Many different versions can be produced by these templates by changing the above parameters; different configurations yield different performance.

figure c

Both algorithms unfolds in three phases. The first and third phase eliminate the data dependency. In the first phase (lines 5–9 and 5–8, respectively), an intermediate buffer, called sum, is used for storing the reduction (sum, multiplication, etc.) of the elements of each processed chunk; the reduction of the ith chunk is stored in sum[i]. The algorithms, in every iteration, update different sum’s cell. In the second phase (lines 11 and 10–11, respectively), they run an inclusive prefix-sums on sum. In the final phase (lines 13–14 and 13–16, respectively), they use the sum[i] to update the elements of the \(i+1\) chunk. There are many differences between the two templates. For instance, seq1 performs extra copies (buffer temp) in order to vectorize. seq2 runs inclusive prefix-sums on the last phase, contrast to seq1 which propagates sum elements to the chunks.

The vectorized instructions of seq1 exhibits no performance gain over the other OSPKs due to the limited amount of computational operations over the loaded data but it is expected to dissipate less energy due to single instruction multiple data (SIMD) usage. SIMD is capable to increase the performance per Watt in many applications because less instructions need to be decoded, fetched, issued, etc., resulting in less instruction cache misses and pipeline pressure.

figure d

The second category, seq3, is comprised of algorithms based on the template described by Algorithm 5. seq3 is an alteration of the trivial_seq algorithm by transforming its loop to a two level nested loop with different loop unrolling depth for the inner loop. The outer loop uses \(k\times V\) as a stride which is defined at compile time. The k corresponds to the number of iterations of the inner loop and V corresponds to the size of each loop. We do not provide results about which is the best k and V. However, they are architecture specific parameters, assisting compiler to provide better code optimizations (loop unrolling). We annotate as seq3_k:d the algorithms with loop unrolling depth d equal to 1,2,...,k.

figure e

2.3 Thread Placement Policies

The memory bandwidth of modern processors is saturated after a certain number of cores fetch data in parallel; their number varies across different processors. Since CPPS is a memory bandwidth intensive algorithm, crossing the saturation point explains its poor scalability. More bandwidth can be acquired when running over NUMA architectures while these systems demonstrate linear memory bandwidth scalability by adding NUMA nodes. Avoiding the bandwidth saturation is crucial, and a common tactic for algorithms having regular memory accesses, such as CPPS, is to place threads across the NUMA nodes (memory-affinity). Subsequently, we would expect increase in CPPS’s performance by spreading threads across NUMA nodes. Therefore, we focus on two policies, usually called Scatter and Compact, to investigate the performance-energy trade-off. The Scatter corresponds to evenly distributed threads across NUMA nodes and the Compact corresponds to consecutive core allocation starting from physical core 0.

3 Discussion

Current compilers, such as, gcc and Intel’s icc are unable to optimize trivial_seq mostly because of the data dependencies. That leads to inefficient hardware pipelining and many wasted CPU cycles running on maximum frequency, penalizing performance and energy consumption. Eventually, algorithms, such as CPPS, perform poorly and consume more energy. We assume that this behavior could change by configuring CPPS with one of the proposed OSPKs.

Manipulating frequency and analyzing its impact to performance and energy is of major importance. We are interested in DVFS (Dynamic Voltage and Frequency Scaling) exploration of the sequential and parallel side of prefix-sums and the correlation between them. We assume that CPPS’s performance-loss while descending to lower frequency levels does not increase with the same rate when it is configured with an OSPK rather than trivial_seq. In the same manner, we expect greater energy savings in lower frequencies with OSPKs.

Thread placement can also influence CPPS’s performance when running on NUMA architectures, independently of number of threads and SPK. We could assume that by placing less threads per NUMA node the memory bandwidth per thread could be increased and eventually its overall performance be improved. However, the energy cost increases, when applying different TPP, such as Scatter, proportionally to the number of activated nodes. In general, it is unclear how much the performance improves and which are the factors suggesting to choose a different TPP other than Compact. Therefore, investigating the energy-performance trade-off is significant for different TPPs.

4 Experiments

We validate our assumptions through physical energy/time measurements on a system called Pluto. This system consists of two 8-cores Intel Xeon E5-2650 processors, giving a total of 16 cores, clocked at 2.6 GHz. Each 8-core processor has its own NUMA partition. This gives a total of 2 NUMA partitions across the system. The memory hierarchy consists of 3 levels of caches (L1(private), L2(private), L3(shared)) where the LLC is shared among 8 cores. All the benchmarks have been compiled with GCC 4.8.3-9 with -O3 optimization level and executed 30 times. The following results depict median values over the 30 runs.

The experiments show performance and energy behavior of the CPPS algorithm and its building blocks (TPP, SPK) when the associative operator addition (\(+\)) is used. The underlying mechanism of TPP is assisted by hwloc [3], a useful tool knowing the mapping between physical-cores and OS-cores. We capture the energy in a socket-based level and the total energy consumption is given by \(t\_energy=\sum _{s=1}^{ns} E_{s}\), where \(E_{s}\) denotes the energy consumption of socket s and ns the number of occupied sockets. The energy measurements, reported in Joules, are derived by RAPL (Run Average Power Limit). For all the experiments we use only arrays of size \(10^9\) 32-bit integers. Experiments on floats or doubles are omitted due to space constraints. However, the results are quantitatively similar. We chose very big problem sizes mostly because of the granularity that RAPL estimates energy (every 1 ms). For the CPPS’s testbeds in-parallel first-touch page placement has been applied. All the threads initialize their assigned memory addresses prior to the computational phase.

RAPL. Since the advent of Intel’s Sandy Bridge architecture a new feature has been integrated in Intel’s processors for accurately estimating energy consumption, called RAPL [7]. It provides an interface which interacts with the MSRs (Model Specific Registers), specific hardware registers delivering information about energy and power consumption. The RAPL interface is divided into four different domains. The domains are: PKG (whole CPU Package), Two power planes: PP0 (processor cores only) and PP1 (a specific device in the uncore), and DRAM (memory controller). PP1 and DRAM are available on desktop and server CPUs, respectively. A more detailed documentation is located in [5].

DVFS Configuration. Intel’s processors, excluding Haswell, regulate their cores under the same power state, implying that all cores must operate on the same frequency. For that reason, in every case, we fix the same frequency for the whole system (all cores) in order to preempt OS to choose different frequency.

Fig. 1.
figure 1

Performance and absolute speedup comparison between two instances of CPPS (seq2, trivial_seq) and other public approaches for problem size of \(10^9\) 32-bit integers. The sequential kernel seq2_A:8_V:2 has been used as a baseline for the speedup

CPPS Against Others. Comparing CPPS’s performance against publically available implementations is essential before reasoning about CPPS’s energy and performance capabilities. Thus, in Fig. 1, we demonstrate results of CPPS’s performance and absolute speedup against other public implementations (MCSTL [12], TBB [10], Nan Zhang [16], SWARM [1]). In this experiment, seq2_A:8_ V:2 has been used to get a baseline comparison for the speedup values. CPPS has also been configured with this OSPK and trivial_seq. The other implementations work by default with trivial_seq. As can be seen, CPPS outperforms all the other approaches and exhibits better scalability. The improvement in performance and scalability is due to better cache reusability, better load balancing along the 3 phases of CPPS, and the selection of an OSPK.

Thread Placement. As discussed, CPPS’s performance can benefit from Scatter over NUMA architectures. However, in Pluto, there are three factors implying why Scatter slightly improves performance (\(<\) 4 %). First, the memory bandwidth is saturated only with 4 cores (half socket). Second, the memory bandwidth scalability is low 2.4 over 8. Third, two NUMA nodes are not enough to compensate the memory bandwidth saturation. For instance, by spreading 12 threads in two sockets, we still operate on the maximum memory bandwidth per socket, CPPS performance does not improve and energy consumption doubles.

In this experiment, CPPS is configured with Compact and Scatter thread placement, for both trivial_seq and seq2_A:128_V:16. The results are depicted in Table 1; all configurations run on the maximum frequency (2.6 GHz). The energy has been monitored with RAPL’s package domain (PKG). The performance penalization is negligible, as expected, under both TPPs. However, Compact policy demonstrates high energy savings for both instances, \(-0.86\) %–46 % and 1.81 %–47.50 % respectively. This is mostly due to the number of activated sockets. The Scatter mode always activates two sockets, in contrast to Compact which activates only one (\(<\) 8 threads). In addition, the Table shows that choosing an OSPK yields even more energy gains, 9 %–17.56 %.

Table 1. Performance/energy comparison between 4 different CPPS’s configuration sets for problem size of \(10^9\) 32-bit integers by combining members of a SPK set: seq2_A:128_V:16 and trivial_seq with those from TPP set: Scatter and Compact

SPK. The SPK is a dominant performance/energy factor of CPPS. We have shown, in Table 1, that an OSPK can significantly increase performance of CPPS. Likewise, the energy consumption of CPPS can be reduced even further by using DVFS and other OSPKs. Intuitively, choosing a lower frequency is not performance-loss-free. However, the information provided in this paper can be useful when maximizing the performance under power caps. Thus, we investigate the energy consumption of several OSPKs on different frequencies before plugging them into CPPS. In the next step, we show the energy consumption of CPPS when applied the energy efficient SPKs derived from the current results.

In Fig. 2 depicts (1) left figure, the energy consumption of different SPKs across all the frequency levels of Pluto (1.2–2.6 GHz) and (2) right figure, the energy gain for each SPK separately when shifting from the highest (2.6 GHz) to lower frequency levels. We monitor energy based on the reports of RAPL’s core domain (PP0). We choose PP0 over PKG and DRAM mainly because PP0’s behavior is highly dependent on the CPU frequency, according to [14], and PKG/ DRAM frequency control is not supported in the current architecture. From both figures, it derives that other SPKs (seq1, seq2, seq3) with several configurations (i.e. seq1_A:4_V:4, seq3_k:2_V:8, etc.) exist, outperforming and consuming less energy than trivial_seq. The right figure points out the percentage energy gain for the same SPKs of left figure. Here, some OSPKs, such as, seq3_k:2_V:8 scale in terms of energy by lowering the frequency while trivial_seq shows no energy scalability lower than 1.6 GHz. In general, these results enhance the investigation of CPPS’s energy behavior in different P-states by plugging in different OSPKs.

Fig. 2.
figure 2

Energy consumption comparison between trivial_seq and various OSPKs (seq2, seq3) on different frequency levels for problem size of \(10^9\) 32-bit integers

CPPS \(+\) SPK \(+\) DVFS. In the previous experiments we showed that trivial _seq is not an energy efficient SPK and we have introduced OSPKs which can obtain better performance and significant energy savings. It is also depicted that the OSPKs exhibit energy scalability under DVFS. These are strong indications that CPPS can consume less energy by choosing the best SPK-frequency configuration. Subsequently, the last experiment has been built to show the impact of several different of these configurations to CPPS’s energy, captured by using RAPL’s PP0 domain, and performance for different number of threads.

Figure 3 shows CPPS’s energy consumption (left figure) and performance (right figure) for different setups (number of threads, SPKs and DVFS). For the sake of readability, only 6 distinct frequencies are reported. The results show strong correlations with the previous results and unveil that the number of threads contributes to which SPK-frequency configuration should be selected. All the SPKs give better energy consumption in lower frequencies independent of the number of threads. The energy consumption increases when deploying more than 8 threads, due to the activation of the second socket and the memory bandwidth saturation on the first socket. The memory bandwidth saturation can cause severe performance loss to CPPS and eventually leads to energy increase.

Fig. 3.
figure 3

CPPS’s energy and performance behavior when configured with different SPKs and executed under different frequencies for problem size of \(10^9\) 32-bit integers

In addition, Fig. 3 shows great energy reductions and performance improvements when using OSPKs against the trivial_seq, independent of the number of threads and the CPU frequency. By choosing an OSPK over trivial_seq, CPPS can achieve great energy gains (24 %–55 %). For instance, by using two threads, frequency 2.4 GHz and seq2_A:8_V:2, CPPS achieves 35 % less energy compared to trivial_seq. At the same time, it is noticeable that lower the frequency we choose, higher is the performance reduction penalty (right figure). However, CPPS’s performance penalization rate varies across SPKs and number of threads. For instance, by using an OSPK, such as, seq1_A:4_V:4 or seq2_A:8_V:2 or seq3_k:1_V:8, and two threads, CPPS’s performance penalization rate increases (3.7 %–8.2 %) while descending to lower frequency levels but improves with more threads (16 threads: 1.3 %–5.2 %). In contrast, by activating trivial_seq the rate range is 4 %–8.4 % and stays constant in more threads.

5 Conclusion

This paper has presented CPPS (cache-aware parallel prefix-sums) and a variety of OSPKs (optimized sequential prefix-sums kernel). We have provided performance/energy results of CPPS running with different configurations (SPK (sequential prefix-sums kernel), TPP (thread placement policy), CPU frequency and number of threads), as well as for various SPKs. Overall, through experiments we demonstrate that the OSPKs can improve CPPS’s performance from 24 % to 54 % compared to trivial_seq when running on the maximum frequency. We demonstrate that choosing OSPK over trivial_seq delivers more energy savings (24 %–55 %) to CPPS in lower frequencies. Our results indicate better energy reduction rate while descending to lower frequency levels when OSPKs are plugged into CPPS rather than trivial_seq. In addition, we show the impact of thread placement across NUMA nodes to the energy consumption. Our results indicate significant energy reductions up to 47.5 % when placing threads in fewer NUMA nodes with almost negligible performance degradation and up to 56.73 % when considering both OSPK and TPP.

In general, finding the most energy efficient CPPS is a multivariable optimization problem and the necessity of developing an auto-tuning mechanism generating performance/energy optimized CPPS code is crucial. In the future, we are interested in studying CPPS’s performance/energy behavior on larger shared memory systems and on Intel’s Xeon Phi coprocessor. Exploiting latter’s wide register vectors could lead to more energy savings and potentially benefit applications like summed-area table (SAT) which extensively use SPK.