Cache and energy efficient algorithms for Nussinov’s RNA Folding

Background An RNA folding/RNA secondary structure prediction algorithm determines the non-nested/pseudoknot-free structure by maximizing the number of complementary base pairs and minimizing the energy. Several implementations of Nussinov’s classical RNA folding algorithm have been proposed. Our focus is to obtain run time and energy efficiency by reducing the number of cache misses. Results Three cache-efficient algorithms, ByRow, ByRowSegment and ByBox, for Nussinov’s RNA folding are developed. Using a simple LRU cache model, we show that the Classical algorithm of Nussinov has the highest number of cache misses followed by the algorithms Transpose (Li et al.), ByRow, ByRowSegment, and ByBox (in this order). Extensive experiments conducted on four computational platforms–Xeon E5, AMD Athlon 64 X2, Intel I7 and PowerPC A2–using two programming languages–C and Java–show that our cache efficient algorithms are also efficient in terms of run time and energy. Conclusion Our benchmarking shows that, depending on the computational platform and programming language, either ByRow or ByBox give best run time and energy performance. The C version of these algorithms reduce run time by as much as 97.2% and energy consumption by as much as 88.8% relative to Classical and by as much as 56.3% and 57.8% relative to Transpose. The Java versions reduce run time by as much as 98.3% relative to Classical and by as much as 75.2% relative to Transpose. Transpose achieves run time and energy efficiency at the expense of memory as it takes twice the memory required by Classical. The memory required by ByRow, ByRowSegment, and ByBox is the same as that of Classical. As a result, using the same amount of memory, the algorithms proposed by us can solve problems up to 40% larger than those solvable by Transpose.


Introduction
RNA secondary structure prediction (i.e., RNA folding) [1] "is the process by which a linear ribonucleic acid (RNA) molecule acquires secondary structure through intra-molecular interactions. The folded domains of RNA molecules are often the sites of specific interactions with proteins in forming RNA-protein (ribonucleoprotein) complexes. " Unlike a paired double strand DNA sequence, RNA primary structure is single strand which could be pairs in P is maximum. (A pseudoknot [2] "is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. ") Smith and Waterman (SW) [3] and Nussinov et al. [4] proposed a dynamic programming algorithm for RNA folding in 1978. Zuker et al. [5] modified Nussinov's algorithm using thermodynamic and auxiliary information. The asymptotic complexity of the SW's, Nussinov's, and Zuker's algorithms are O(n 3 ) time and O(n 2 ) space, where n is the length of the RNA sequence. Li et al. [6] proposed a cache-aware version of Nussinov's algorithm, called Transpose, that takes twice the memory but reduces run time significantly. Many parallel algorithms for RNA folding have also been proposed ( see, for e.g., [6][7][8][9][10][11][12][13][14][15]).
In this paper, we focus on reducing the number of cache misses that occur in the computation of Nussinov's method without increasing the memory requirement. Our interest in cache misses stems from two observations-(1) the time required to service a lowest-level-cache (LLC) miss is typically 2 to 3 orders of magnitude more than the time for an arithmetic operation and (2) the energy required to fetch data from main memory is typical between 60 to 600 times that needed when the data is on the chip. As a result of observation (1), cache misses dominate the overall run time of applications for which the hardware/software cache prefetch modules on the target computer are ineffective in predicting future cache misses. The effectiveness of hardware/software cache prefetch mechanisms varies with application, computer architecture, compiler, and compiler options used. So, if we are writing code that is to be used on a variety of computer platforms, it is desirable to write cache-efficient code rather than to rely exclusively on the cache prefetching of the target platform. Even when the hardware/software prefetch mechanism of the target platform is very effective in hiding memory latency, observation (2) implies excessive energy use when there are many cache misses.
We develop three algorithms that meet our objective of cache efficiency without memory increase-ByRow, ByRowSegment, and ByBox. Since these take the same amount of memory as Classical and Transpose takes twice as much, the maximum problem size (n) that can be solved in any fixed amount of memory by algorithms Classical, ByRow, ByRowSegment, and ByBox is 40% more than what can be done by Transpose. On practical but large instances, ByRow and ByRowSegment have the same run time performance. Our experiments indicate that, depending on the computational platform and programming language, either ByRow or ByBox give best run time and energy performance. In fact, the C version of our proposed algorithms reduce run time by as much as 97.2% and energy consumption by as much as 88.8% relative to Classical and by as much as 56.3% and 57.8% relative to Transpose. The Java versions reduce run time by as much as 98.3% relative to Classical and by as much as 75.2% relative to Transpose.
The rest of the paper is organized in the following way. We first introduce our simple cache model that we use in our cache-efficiency analysis. Then we propose three cache-and memory-efficient RNA folding algorithms. These algorithms are being theoretically analyzed using our cache model. Finally, we present our experimental and benchmark results.

Cache model
We use a simple cache model so that the cache miss analysis is of manageable complexity. In this model, there is a single cache whose capacity is sw words, where s is the number of cache lines and w is the number of words in a cache line. Each data item is assumed to have the same size as a word. The main memory is assumed to be partitioned into blocks of size w words each. Data transfer between the cache and memory takes place in units of a block (equivalently, a cache line). A read miss occurs whenever the program attempts to read a word that is not in cache. To service this cache miss, the block of main memory that includes the needed word is fetched and copied into a cache line, which is selected using the LRU (least recently used) rule. Until this block of main memory is evicted from this cache line, its words may be read without additional cache misses. We assume the cache is written back with write allocate. That is, when the program needs to write a word of data, a write miss occurs if the block corresponding to the main memory is not currently in cache. To service the write miss, the corresponding block of main memory is fetched and copied in a cache line. Write back means that the word is written to the appropriate cache line only. A cache line with changed content is written back to the main memory when it is about to be overwritten by a new block from main memory.
In practice, modern computers commonly have two or three levels of cache and employ sophisticated adaptive cache replacement strategies rather than the LRU strategy described above. Further, hardware and software cache prefetch mechanisms, out of order executions are often deployed to hide the latency involved in servicing a cache miss. These mechanisms may, for example, attempt to learn the memory access pattern of the current application and then predict the future need for blocks of main memory. The predicted blocks are brought into cache before the program actually tries to read/write from/into those blocks thereby avoiding (or reducing) the delay involved in servicing a cache miss. Actual performance is also influenced by the compiler used and the compiler options in effect at the time of compilation.
As a result, actual performance may bear little relationship to the analytical results obtained for our simple cache model. Despite this, we believe the simple cache model serves a useful purpose in directing the quest for cache-efficient algorithms that eventually need to be validated experimentally. We believe this because our simple model favors algorithms that exhibit good spatial locality in their data access pattern over those that do not and all cache architectures favor algorithms with good spatial locality. The experimental results reported in this paper strengthen our belief in the usefulness of our simple model. These results indicate that algorithms with a smaller number of cache misses on our simple model actually have a smaller number of (lowest level) cache misses on a variety of modern computers that employ potentially different cache replacement strategies (vendors often use proprietary cache replacement strategies). Further, a reduction in cache misses on our simple model often translates into a reduction in run time.

Classical RNA folding algorithm (Nussinov's algorithm)
Let A[1 : n] = a 1 a 2 · · · a n be an RNA sequence and let H ij be the maximum number of the complimentary pairs in a folding of the sub-sequence A[i : j], 1 ≤ i ≤ j ≤ n. So, H 1n is the score of the best folding for the entire sequence A[1 : n]. The following dynamic programming equations to compute H 1n are due to Nussinov [4].
where c(a i , a j ) is the match score between characters a i and a j . If a i and a j are complimentary pairs such as AU, GC or GU, c(a i , a j ) is 1, otherwise it is 0. The different cases of the recurrence in Nussinov's algorithm are illustrated in Fig. 1, where Fig. 1a shows the case when a i is added to the best RNA folding of the subsequence Figure 1b shows the case when a j is added to the best RNA folding of A[i : j − 1], Fig. 1c  Due to the fact that Fig. 1a and b can be considered as a special case of combining two subsequences where one of them is a single node subsequence. Several authors ( [15], for example) have observed that Nussinov's equations may be simplified to Once the best RNA folding score, H 1n , has been computed, a standard dynamic programming traceback procedure, which takes O(n) time, may be performed to find the path leading to the maximum score. This path defines the actual RNA secondary structure.
Algorithm 1 gives the Classical algorithm to compute H 1n using the simplified Nussinov's equations. This algorithm computes H by diagonals and within a diagonal from top to bottom. It's run time is O(n 3 ). Although the algorithm is written using two-dimensional array notation for H, we need only the upper triangle of H. Hence, a memory efficient implementation would either map the upper triangle into a 1D array or employ a dynamically allocated 2D array with variable size rows. In either case, we would need memory for n(n + 1)/2 elements of H rather than for n 2 elements.
For the (data) cache miss analysis, we focus on read and write misses of the array H and ignore misses due to the reads of the sequence A as well as of the scoring matrix c (notice that there are no write misses for A and c). Figure 2 shows the memory access pattern for H.   for i ← 0 to n − 1 − d do 9: 11: for k ← i to j − 1 do 12: temp ← max ( as have 2 elements of the fourth; we are presently computing the third element (H ij ) of the fourth diagonal. Figure 2b shows the elements of H in row i and column j that are needed for the computation of H ij (i.e., in the computation of max{H i,k + H k+1,j }). The elements in row i are accessed from left to right while those in column j are accessed from top to bottom. So, w row elements are brought into cache with a single miss and a miss takes place for each element of column j that is accessed. Note that the cache lines for column j also contain the column j + 1 data needed in the computation of H i+1,j+1 . However, when n is sufficiently large, this data is overwritten by new data under the LRU policy before it can be used in the computation of H i+1,j+1 . So, for each of the j − i sums of max{H i,k + H k+1,j } we incur 1/w read misses on average for H i,k and 1 read miss for H k+1,j . Over the entire computation we compute n 3 /6 (plus low order terms) of these sums incurring a total of (n 3 /6)(1 + 1/w) read misses. Although to complete the computation of H i,j we also need H i+1,j−1 , accessing these values of H incurs only O(n 2 ) read misses. The number of write misses for H is also O(n 2 ). So, for our simplified cache model, the number of cache misses incurred when computing H using algorithm Classical is (n 3 /6)(1 + 1/w) (plus low order terms).

Transpose RNA folding algorithm
Li et al. [6] have proposed a cache-efficient computation of Nussinov's simplified equations. Their algorithm, which we refer to as Transpose, uses an n × n array H in which the upper triangle is used to store the H i,j , j ≤ i, values defined by Nussinov's equations and the lower triangle is used to store the transpose of the upper triangle. That is, H i,j = H j,i for all i and j. As new H ij s are computed, they are stored in both H i,j and H j,i . The sum , with the result that a sum now requires only 2/w cache misses on average. So, the total number of read misses is (n 3 /6)(2/w) plus low order terms. The number of write misses is O(n 2 ). The ratio of cache misses of Classical to Transpose is approx-

ByRow RNA folding algorithm
Although Transpose reduces the number of cache misses (in our model) by an impressive factor of (w+1)/2 relative to Classical, it does so at the cost of doubling the memory requirement. The increased memory requirement means that Classical can be used to solve problems up to 40% bigger than can be solved by Transpose on any computer with a fixed memory size. For smaller instances that can be solved by both algorithms, we expect Transpose to take less time. In this section, we propose an alternative cacheefficient algorithm ByRow that does not have a memory penalty associated with it. In our cache model, ByRow incurs the same number of cache misses as incurred by Transpose.
The algorithm ByRow computes the H i,j s by row bottom-to-top and within a row left-to-right. This is illustrated in Fig. 3. Figure 3a shows the situation after the 4 bottommost rows of H have been computed. The computation of the next row (i.e, row 5 from the bottom in our example) is done in two stages. Note that the first two elements on each row are 0 by definition. So, only elements 3 onward are to be computed. In the first stage, every H i,j , j > i + 1 on the row being computed is initialized to H i+1,j−1 . The memory access pattern for this is shown in Fig. 3b. The second stage comprises many sub-stages. In a sub-stage, all H i,j s in row i are updated using the sums H i,k +H k+1,j for a single k. In the first sub-stage, we use H i,i and H i+1,j to update H i,j , j > i + 1 (see Fig. 3c). In the next sub-stage, we use H i,i+1 and H i+1,j to update H i,j , j > i + 1 and so on. Algorithm 2 gives the details.  10: end for 11: for k ← i to n − 2 do 12: for j ← k + 1 to n − 1 do 13: 14: end for 15: end for 16: It is easy to see that ByRow takes O(n 3 ) time and that its memory requirement is the same as that of Classical and about half that of Transpose. For the cache miss analysis, we see that for each element initialized in stage 1, an average of 1/w read misses and 1/w write misses occur. So, this stage contributes O(n 2 ) to the overall cache miss count. For the second stage, we see that the total number of read misses for the first term in an H i,k + H k+1,j over all sub-stages is O(n 2 /w) and that for the second term is (n 3 /6)(1/w) (plus low order terms). Additionally, there are (n 3 /6)(1/w) (plus low order terms) read misses for H i,j . So, the total number of misses is (n 3 /6)(2/w) (plus low order terms).
The algorithm ByRowSegment reduces this count by computing the elements in each row of H in segments of size no larger than the capacity of our cache. The segments in a row are computed from left to right. When the segment size is s, the number of read misses for H ik becomes (n 3 /6)(1/s). The misses for H k+1,j remains (n 3 /6)(1/w). So, the total number of misses is further reduced to (n 3 /6)(1/s + 1/w).

ByBox RNA folding algorithm
In the ByBox algorithm, we partition H into boxes and compute these boxes in an appropriate order. For the partitioning, we first divide the rows of H into strips of p rows each from bottom-to-top (Fig. 4a). Note that the top most strip may have fewer than p rows. Next each strip is partitioned into a triangle box and multiple rectangle boxes (Fig. 4b). The width of the first box is p, that of all but the last of the remaining boxes is q, and that of the last is ≤ q.
Observe that the first box in a strip is a p × p triangle (the height of the triangle in the topmost strip may be less than p), the last box in a strip is a p × q rectangle (again the height in the top strip may be less than p), and the remaining boxes are p ×q boxes (again, the height may be less in the top strip).
The elements in triangular boxes are computed using ByRow. These triangular boxes may be computed in any order. The rectangular boxes are computed by strips bottom-to-top and within a strip from left-to-right. Let T denote the rectangular box to be computed next (Fig. 5a). Because of the order in which rectangular boxes are computed, all H values to its left and below it have already been computed. Let L 0 , L 1 , · · · , L k−1 be the boxes to the left of T. Note that L 0 is a triangular box. Partition the Hs below T into q × q boxes B 1 , B 2 , · · · , B k−1 plus a last triangular box B k whose width is w (Fig. 5b).
To compute T, we first consider the pairs of rectangular boxes The time and memory required by algorithm ByBox are the same as for Classical and ByRow. For the cache miss analysis, assume that we have enough cache to hold one pair (L i , B i ) as well as the box T. Loading L i and B i into cache incurs pq/w misses for L i and q 2 /w for B i . The number of H i,k + H k+1,j computations we can do for each H in T without additional misses is q. So, with (p+q)q/w cache misses we can do pq 2 sum computations. Or, an average of (p+q)q/(wpq 2 ) = (p+q)/(wpq) misses per computation. Therefore, to do all n 3 /6 required computations we incur (n 3 /6)(p+q)/(wpq) cache misses. The misses attributable to the remaining terms in Nussinov's equations as well as to writes of H are O(n 2 ) and may be ignored.
When q = w, the cache miss count for ByBox becomes (n 3 /6)(1/w 2 + 1/(wp)), which is quite a bit less than that for our other algorithms.
When p = 1, ByBox has much similarity with ByRowSegment. However, since ByBox needs sufficient cache for a q × q B i , q ≤ √ s, where s is the largest segment size that can be accomodated in cache. So, the miss count for ByBox is (n 3 /6)(p + q)/(wpq) = (n 3 /6)(1 + 1/ √ s)(1/w), which is more than that for ByRowSegment when w < √ s.

Practical considerations
We make the following observations regarding our expectations for the performance of the various Nussinov's algorithms described in this section: 1. We have used a very simple 1-level cache model for our analyses and also assumed an LRU replacement strategy. Modern computers have two or three levels of cache and employ more sophisticated cache replacement strategies. So, our analyses, are at best a crude approximation of actual cache misses. 2. Modern computers employ sophisticated hardware and software methods for cache miss prediction and prefetch data based on this prediction. To the extent these methods are successful in accurately predicting the need for data sufficiently in advance, the latency due to cache misses can be masked. As a result, observed run times may not be indicative of cache misses. 3. In practice, the maximum n will be small enough that many of the cache misses counted in our analyses will actually not occur. For example, in the ByRow algorithm, the lowest level cache will usually (say), we will need more than 2 × 10 10 bytes of main memory to hold the upper triangle of H (assuming 4 bytes per element) and only 400,000 bytes of cache to hold a row of H. As a result, the cache misses for H i,j will be O(n 2 ) rather than O(n 3 ). Similarly, for ByRowSegment, s = n. So, in practice, we expect ByRow and ByRowSegment to have the same performance. 4. In ByBox, using a q as small as w is not expected to result in speedup because of the overheads involved in this algorithm. In practice, we wish to use large nearly square boxes such that L i , B i , and T fit in cache. When the size of the lowest level cache is sufficient for 3 * 2 20 elements (say), we could set p = q = 1024.

Experimental platform and test data
We implemented the Classical, Transpose, ByRow, and ByBox RNA folding algorithms in two programming languages -C and Java. For the data set sizes used by us, ByRow and ByRowSegment are identical as a row fits into cache and the segment size equals the row size. Consequently, we did not experiment with ByRowSegment. For all but Transpose, we conducted preliminary tests benchmarking 3 different implementations as below: 1. H is a classical n × n array. 2. The upper triangle of H is mapped into a 1D array of size n(n + 1)/2 in row-major order [16]. 3. H is a 2D array with variable size rows. The first row has n entries, the next has n − 1, the next has n − 2, · · · and the last has 1 entry. Such an array may be dynamically allocated as in [16] The last two of these implementations take about half the memory as taken by Transpose and the first implementation. Our preliminary benchmarking showed that, in C, the last implementation is faster than the other two while in Java the first implementation is the fastest and the third next fastest. More specifically, the third implementation takes between 1% and 4% less time than the Our Xeon platform had tools to measure cache misses and energy consumption. So, for this platform we report cache misses and energy consumption as well as run time. On this platform, we used the "perf " [17] software to measure energy usage through the RAPL interface. For the PowerPC A2 (Blue Gene Q) platform, the MonEQ software [18,19] was used to measure the power usage every

C Implementations
Xeon E5-2603 Figure 6 and Table 1 give the run times of our various algorithms for our random data sets on our Xeon platform for sequence sizes between 4000 and 40000. Figure 7 and Table 2 do this for sample real RNA sequences from [20]. In both figures, the time is in seconds while in both tables, the time is given using the format hh : mm : ss. We did not measure the time required by Classical for n > 28, 000 as this algorithm took almost 6 hours for n = 28, 000. The column labeled RvsC (BvsC) in Tables 1 and 2 gives the run time reduction achieved by ByRow (ByBox) relative to Classical. Similarly, RvsT and BvsT give the reductions relative to Transpose. As can be seen, on our Xeon platform, ByRow performs better than Classical and Transpose algorithms, ByBox outperforms all other three algorithms. On the randomly generated data set, the ByRow algorithm reduces run time by up to 89.13% compared to the original Nussinov's Classical algorithm and by up to 35.18% compared to the cache-efficient Transpose algorithm of Li et al. [6]. The corresponding reductions for ByBox are up to 91.26% and 56.31%. On the real RNA sequences, ByRow algorithm reduces run time by up to 90.38% and 35.19% compared to Classical and Transpose algorithm. The corresponding reductions for ByBox are up to 91.93% and 56.58%.
Since the results for randomly generated RNA sequences are comparable to those for similarly sized sequences from the NCBI database [20], in the rest of paper, we present results only for randomly generated sequences. Figure 8 and Table 3 gives the number of cache misses on our Xeon platform. ByBox reduces cache misses by up . The very significant reduction in cache misses is expected given the cache miss analysis was done using our simple cache model. The reduction in run time, while significant, isn't as much as the reduction in cache misses possibly due to the effect of cache prefetching, which reduces cache induced computational delays. Figure 9 and Tables 4 give the CPU and Cache energy consumption, in joules, by our Xeon platform. On our datasets, ByBox required up to 88.77% less CPU and Cache energy than Classical and up to 57.76% less than Transpose. It is interesting to note that the energy reduction is comparable to the reduction in run time suggesting a close relationship between run time and energy consumption for this application. Figure 10 and Table 5 give the run times on our AMD platform. The Classical algorithm took over 9 hours for n = 16, 000. As a result, we did not measure the run time of this algorithm for larger values of n. ByBox is faster than ByRow and both are substantially faster than Classical and Transpose. ByBox reduced run time by up to 97.16% compared to Classical and by up to 39.55% compared to Transpose. The reductions achieved by ByRow relative to Classical and Transpose were up to 96.08% and up to 18.33%, respectively. Figure 11 and Table 6 give the run times on our Intel I7 platform. Once again, we were unable to run Classical on our larger data sets (this time, n > 28, 000) because of the excessive time required by this algorithm on these larger data sets. As was the case for our Xeon and AMD platforms, the algorithms are ranked ByBox, ByRow, Transpose, Classical, fastest to slowest. The run time reduction achieved by ByBox is up to 93.70% relative to Classical and up to 51.92% relative to Transpose. ByRow is up to 89.19% faster than Classical and up to 15.62% faster than Transpose.   Table 7 give the run times on our Power PC A2 platform. On this platform, we were able to run Classical only for n ≤ 8000 and the remaining algorithms only for n ≤ 15, 000, because of the excessive time required by our algorithms on larger instances. On this platform, the speed ranking of our algorithms is consistent with our other 3 platforms. The ranking, fastest to slowest, is now ByBox, ByRow, Transpose, Classical. ByBox is up to 87.74% faster than Classical and up to 33.43% faster than Transpose, where ByRow is up to 84.18% faster than Classical and up to 14.68% faster than Transpose. Table 8 gives the energy consumption in joules on our Power PC platform. As other platforms, the energy reduction by our cache efficient algorithms tracked run time quite closely. For example, while ByBox was almost always slower than Transpose, it almost always used less energy. ByBox reduced energy consumption by up to 87.59% relative to Classical and by up to 40.31% relative to Transpose. And ByRow is up to 82.6% and 16.7% relative to Classical and Transpose, respectively. The Java implementations take much substantially time and memory than do the C implementations. Because of memory limitations, Transpose could not be run on our AMD and Intel platforms for n ≥ 16, 000 and n ≥ 24, 000, respectively. Because of time requirements, we did not experiment with n > 28, 000 for any algorithm on any platform. The speed ranking, fastest to slowest, for the Java implementations is ByBox, Transpose, ByRow, Classical. The Java implementation of ByBox was up to 88.9% faster than the Java implementation of Classical on our Xeon platform, up to 98.3% faster on the AMD, and up to 88.5% faster on the Intel I7. The corresponding speedups relative to the Java implementation of Transpose were 75.2%, 64.6%, and 69.7%.

Java implementations
We observe that the run time of ByRow was generally more than that of Transpose on all of our platforms. We suspect this is because our Java code for ByRow makes  Fig. 11 Run time, in seconds, for random sequences on Intel I7 platform more accesses to array elements than made by our Java code for Transpose. Array accesses are expensive in Java as the array indexes are checked for validity whenever an attempt is made to access an array element (we note that C does not perform such a check). Although some Java compilers eliminate this check when they can assert there will be no violation of array bounds, their ability to make this assertion is both variable and limited. In the case of Transpose, our code reduces the number of array accesses significantly by copying an array element that is to be used many times into a simple variable and then referring to this simple variable in reuses of the element. This reduction strategy could not be employed in the code for ByRow. As a result of the increased array bounds checking done in our Java code for ByRow relative to that done in our Java code for Transpose, the former is often slower.

Discussion and conclusions
We have proposed three cache-efficient algorithms-ByRow, ByRowSegment, and ByBox-for RNA folding using Nussinov's dynamic programming equations. Their cache miss efficiency was analyzed using a simple cache model. Although the simple cache model does not accurately reflect the cache architecture of modern computers, it is useful for an initial assessment of cache performance as the model encourages the design of algorithms with good spatial locality and good spatial locality results in better cache performance on virtually all cache architectures. Our algorithms were benchmarked against the classical implementation, Classical, of Nussinov's equations as well as the cache efficient implementation Transpose proposed by Li et al. [6]. The benchmarking was done using four different computational platforms (Xeon E5, AMD Athalon 64, Intel I7, Power PC A2) and two programming languages (C and Java). For the benchmarking, we excluded ByRowSegement, as, for the dataset sizes we could handle on our test platforms, ByRow and ByRowSegment are identical. Our benchmarking shows that, depending on the computational platform and programming language, either ByRow or ByBox give best run time and energy performance. In fact, the C version of these algorithms reduce run time by as much as 97.2% and energy consumption by       as much as 88.8% relative to Classical and by as much as 56.3% and 57.8% relative to Transpose. The Java versions reduce run time by as much as 98.3% relative to Classical and by as much as 75.2% relative to Transpose. The algorithms ByRow, ByRowSegment, ByBox, and Classical require about half as much memory as does Transpose. While run time becomes a limiting factor more often than memory, in our Java experiments, we were unable to run Transpose on our larger data sets on our AMD and Intel I7 platforms because of insufficient memory.