Crosstalk Aware Register Reallocation Method for Green Compilation

As nanoscale processing becomes the mainstream in IC manufacturing, the crosstalk problem rises as a serious challenge, not only for energy-efficiency and performance but also for security requirements. In this paper, we propose a register reallocation algorithm called Nearby Access based Register Reallocation (NARR) to reduce the crosstalk between instruction buses. The method includes construction of the software Nearby Access Aware Interference Graph (NAIG), using data flow analysis at assembly level, and reallocation of the registers to the software. Experimental results show that the crosstalk could be dramatically minimized, especially for 4C crosstalk, with a reduction of 80.84% in average, and up to 99.99% at most.


Introduction
With the progress of technological development, the size of embedded devices becomes smaller and smaller, so the bus lines lay out more and more intensively, making the crosstalk a more and more serious challenge for the circuit design. The increments of crosstalk not only influence the scalability and performance of the embedded system, but also consume more power, making the device more vulnerable to overheat and to malicious attacker. The additional power consumption of crosstalk can particularly be used by attackers through the Differential Power Analysis to get the security and hidden information of the system, such as revealing hidden hardware faults on integrated circuits, accessing cryptographic keys, and getting the actual executing codes of the microprocessors [Mangard, Oswald and Standaert (2011);Zhang, Fang, Li et al. (2016); Park, Xu, Jin et al. (2018); Liu, Yarom, Ge et al. (2015)]. Furthermore, the extra power needed increases noise and decreases the lifetime of the embedded device, therefore compromising the current "green compilation" pursuit.
Crosstalk is a traditional problem for circuit design and many efforts have been done for it. Circuit designers proposed sorts of methods to reduce the crosstalk between couple buses, such as Codec [Duan, Calle and Khatri (2009) ;Shirmohammadi, Mozafari and Miremadi (2017)], buffer insertion [Halak and Yakovlev (2010)], Shielding [Mutyam (2009)], gate sizing [Gupta and Ranganathan (2011)], and so on. Lucas et al. [Lucas and Moraes (2009)] evaluated different crosstalk fault tolerant approaches for Networks-onchip (NoCs) links such that the network can maintain the original network performance even in the presence of errors. Their results demonstrated that the use of CRC coding at each link should be preferred if minimal area and power overhead were the main goals. Cui et al. [Cui, Ni, Miao et al. (2017)] proposed an enhanced code based on the Fibonacci number system (FNS) to suppress the crosstalk noise below 6C level, in which both the redundancy of numbers and the non-uniqueness of Fibonacci-based binary codeword were utilized to search the proper codeword. Experimental results showed that the proposed technique decreased about 22% latency of TSVs comparing with the worst crosstalk cases. Shirmohammadi et al. [Shirmohammadi and Sabzi (2018)] propose DR coding mechanism, which uses a novel numerical system in generating code words that minimizes overheads of codec and is applicable for any arbitrary width of wires. Experimental results show that worst crosstalk-induced transition patterns are completely avoided in wires using DR coding mechanism. Jiao et al. [Jiao, Wang and He (2018)] proposed a crosstalk-noise-aware bus coding scheme with groundgated repeaters. This approach minimized the routing overhead as well as power consumption of data bus systems. The routing overhead was reduced by 12.31% with the new bus coding scheme compared to the conventional data bus with shielding wires. Furthermore, the power leakage and worst-case active power consumptions were reduced by 12.5% and 18.26%, respectively, with the new crosstalk-noise-aware data bus system compared to the previously published bus coding system in an industrial 40nm CMOS technology. Ohama et al. [Ohama, Yotsuyanagi, Hashizume et al. (2017)] proposed a selection method of adjacent lines for assigning signal transitions in test pattern generation. The selection method could reduce the number of adjacent lines used in test pattern generation without degrading the quality of test pattern that could excite the fault effect. Bamberg et al. [Bamberg, Najafi and Garciaortiz (2019)] presented a 3D CAC method which was based on an intelligent fixed mapping of the bits of existing 2D CACs onto rectangular or hexagonal TSV arrangements. Their method required less hardware and reduced the maximum crosstalk of modern TSV and metal wire buses by 37.8% and 47.6%, respectively, while leaving their power consumption almost unaffected. However, these methods either need extra hardware unit support or must increase the area of chip, making them unfavorable for the development of advanced embedded devices requiring portability and minimized cost. With the existing Selective Shielding method, Weng et al. [Weng, Lin, and Shann (2010)] proposed a co-hardware/software register relabeling combination to reduce the crosstalk of instruction bus. Kuo et al. [Kuo, Chiang and Hwang (2007)] adapted also a combination approach with instruction rescheduling, register renaming, NOP instruction padding, and instruction opcode assignment. They proposed the software method to eliminate the 4C crosstalk. These methods are either based on current hardware support or having limitations, illustrated in the next section, to reduce the crosstalk.
Register allocation is also an important component of compilers, many techniques have been proposed, such as Graph-base register allocation [Florea and Geliert (2016); Odaira, Nakaike, Inagaki et al. (2010)], Linear scan register allocation [Poletto and Sarkar (1999); Wimmer and Franz (2010)], tree-based register allocation, and others [Lozano, Carlsson, Blindell et al. (2019), Su, Wu and Xue (2017); Chen, Lueh and Ashar (2018)]. Tabani et al. [Tabani, Arnau, Tubella et al. (2018)] propose a new register renaming technique that leverages physical register sharing by introducing minor changes in the register map table and the issue queue. Experimental results show that it provides 6% speedup on average for the SPEC2006 benchmarks in modern out-of-order processor. Kananizadeh et al. [Kananizadeh and Kononenko (2018)] propose a new class of register allocation and code generation algorithms that can be performed in linear time. These algorithms are based on the mathematical foundations of abstract interpretation and the computation of the level of abstraction. They have been implemented in a specialized library for just-in-time compilation. The specialization of this library involves the execution of common intermediate language (CIL) and low level virtual machine (LLVM) with a focus on embedded systems. But most of these proposed methods were aiming at increasing the performance with litter spill codes, while the crosstalk between instructions was seldom considered. We propose here a software method Nearby Access based Register Reallocation (NARR) to reduce the crosstalk. Though similar to the graph color register allocation method, it is distinguished by combining the frequency of near neighbor access to assign the registers. Our register reallocation approach is not only a software-only method requiring no modifications in hardware, but also improves performance in reducing the instruction bus crosstalk since it deeply analyzes the data flow. Our contributions of this paper in crosstalk-reducing by software can be summarized: -A proposed new register reallocation algorithm called NARR to reduce the crosstalk. It is a pure software method without any hardware modifications and can theoretically get extra power saving and security enhancing by reducing the crosstalk that register renaming and other software techniques could fail to. -A modified interference graph called Nearby Access Aware Interference Graph (NAIG) is designed and implemented with the help of assemble-level data flow analysis and profiling information that make the register reallocation feasible and more easily.
-Implemention of the algorithm in an evaluation of crosstalk improvement. Results show that the NARR is an efficient algorithm in reducing crosstalk, especially in 4C class of crosstalk. The following Section 2 illustrates the background and motivation firstly, and then introduces our new crosstalk aware register allocation algorithm (NARR). Section 3 presents the performance evaluation using a benchmark from Mibench. Section 4 draws some conclusions and highlights future directions.

Methods and materials 2.1 Crosstalk overview
Crosstalk is the noise signal for one circuit or channel of a transmission system caused by the other circuits or channels that are usually parallel to the effected one. The strength of crosstalk is often subject to the following factors: wire length, wire width, switching pattern of nearby wires, and so on. For better evaluating the delay and energy caused by crosstalk, researchers have established the crosstalk delay model and energy model as Eqs. (1) and (2) respectively [Moll, Roca and Isern (2003); Duan, Calle and Khatri (2009); Mutyam (2009) where k is a constant determined by the driver strength and wire resistance, The above formulas show that the different transmission pattern can influence the effect of crosstalk significantly due to their different effective capacitance. According to the effective capacitance of different switching patter of nearby wires in continuous cycles, the crosstalk is classified into six classes shown in Tab. 1. (The symbols -, ↑, ↓, x stand for no, positive and negative, any transitions, respectively.) Currently, research work focuses on how to eliminate the 3C and 4C classes of crosstalk and also to reduce those of other classes. But the established techniques are more or less based on to modification of the integrated circuit that could increase the overhead of the system and therefore can't be used for some cost-constraint embedded systems. There are some software researches attempting to reduce the crosstalk between instruction data buses. Some approaches need to insert extra "NOP" instructions. Others can't make use of the full power of changing register because insufficient amount of program information such as data flow is analyzed. In the next section, we will illustrate the limitations of register renaming techniques, as well as how to overcome these limitations by using register reallocation, with a simple example.

Crosstalk case study
Register renaming is used as software-only or software/hardware combined technique for reducing the crosstalk. Since the lifetime of registers is not analyzed and the results of register allocation are not used, Register renaming alone can't alleviate the limitation of register allocation which aims to use a minimal number of registers to generate a good program performance, therefore losing potential improvement in crosstalk reduction.
Considering the example instruction lists in Kuo et al. [Kuo, Chiang and Hwang (2007)], as illustrated in Fig. 1(a), we can see that the instruction scheduling fail to reduce the crosstalk of instruction buses between I1 → I2 → I3. From the I2 and I3 lines, we can see that R5 is lastly used of its previous definition in I2. And for saving registers, register allocator assigns R5 for saving the results of I3 which causes the 4C crosstalk between I2 and I3. From this piece of codes, R6 is not used and we assume that R6 is available too. If the register allocator uses the R6 to replace R5 for saving the results of I3, the crosstalk will be eliminated (shown in Fig. 1(b)). However, if we use the register renaming to rename R5 with R6, the crosstalk between I2 and I3 will be eliminated, but the new 3C crosstalk will occur between I1, I2 and I2, I3 (see Fig. 1(c)). This example allows us to see the potentially better capability of register allocation in crosstalk reduction, compared to register renaming and to other software techniques.

Crosstalk aware register allocation, optimization process outline
In order to get an effective optimization for the total program such as the library of system, our optimization process utilizes the disassemble codes and the profiling results as inputs.
Then NAIG constructor is used to build the NAIG from the disassemble codes and set the weight of it. Finally, the NARR processor analysis the NAIG to reallocation the register and generate the optimized code. The outline of the process is presented as followed (Fig. 2).

Figure 2: Outline of the optimization process
From this outline, we can see that the kernel of the optimization is NAIG construction and NARR process. The detail will be presented in the following two subsections.

NAIG construction
The goal of this work is to reduce the crosstalk on instruction data bus. So the more frequently access patterns of registers pairs are, the more important the registers are. For better illustrating the nearby accesses frequency feature combining with the register allocation, we enhanced the original Interference graph that was widely used for register allocation and constructed the new nearby access aware interference graph, called NAIG. NAIG is a weighted undirected graph that can be represented by a four tuple G=(V, EI , EN, WE). Where v∈V represents a variable or constant of the program, e(u, v)∈EI expresses that the node u and node v cannot share the same register, e′(u′, v′)∈EN expresses that the node u and node v may be nearby access and the weight w(e)∈WE represents the frequency of such access pattern e (u, v). For building the NAIG, we get the disassemble code as input and suppose that there are unlimited registers, the same as the registers called virtual registers in many compilers.
Firstly, we change the disassemble codes to the SSA form for each basic block that makes sure the registers are defined only once (Algorithm 1, line 1-4). Then, we use the methods described in to construct the data flow of each basic block and get the lifetime of each register in each instruction (line 5-6). The interference graph can be constructed of analysis the live register in each instruction (line 7-15). After getting the interference graph, we can use the profiling results to add the weight of edges (line 16-21). And then we return the constructed NAIG at last (line 22). The detailed construction algorithm is expressed in Algorithm 1. In this program, the CFG is control flow graph for the program and each node v ∈ V ′ represents a basic block that contains number of in order executed instructions. The Liveregi expresses the register defined before the instruction i and will be used after the instruction that called the live register.
To reduce the cost of spill node is a very complex work because it changes the source order of instruction by inserting extra spill codes that will make the NAIG rebuilt. Luckily, in our algorithm, we can avoid to generate the spill codes since the source code is allocated successfully and we can always eliminate the spill code by assigning the spill node with its original one. The detailed reallocation algorithm is presented in next section. Figs. 3(a) and 3(b) are the changed SSA representation and corresponding NAIG for the first three instructions of Fig. 1(a), respectively.

NARR algorithm
Based on the above NAIG, we implement our new NARR algorithm as follow: firstly, we construct the NAIG for each function of the program (Fig. 3); then, we sort the edge of NAIG by the decreased weight order (Algorithm 2, line 1). Since the heavily weighted edge represents that the nodes own the edge are more frequently nearby access in instruction date buses, we expect it in the same register or the least crosstalk registers. At the same time, we expect the lowest spill code which will not only lose the performance of the system, but also increase the undetected crosstalk by this algorithm, so we make sure that no additional new spill codes will be emerged in our algorithm. Then, we analyze the ordered edges one by one to finish the register allocation for each node (lines 2-36). For each edge e(u, v)∈EI , we first check whether a node is assigned. If any one of nodes u is assigned for register ri, we will choose the register other than ri but with minimal crosstalk to assign it for v (lines 5-7). If both nodes are not assigned, we first assign any of them to one register and then find the other suitable register as the previous case for the other one (lines 15-19). If the two nodes are assigned with the same register, we will try to change one assigning into another register (lines 6-14). For the edge not in EI, we first try to assign the two nodes in the same register. If it is not reasonable, we can handle it as the edge in EI (lines 22-34). The program detail is shown in Algorithm 2.

Input:
the NAIG(V,EI,EN,WE) for each function of program; the available registers R={ r0 , ri , … , rn } Output: the allocation map M: for each node V in NAIG; 1: E' := sort EN by decreased order in WE 2: while do 3: e(u,v) :=pop the first element of E' 4: if then 5: if only one node ( assuming for u ) is assigned for register ri then 6: get the register with minimal cost crosstalk(ri,rj) 7: M.add(v,rj) 8: else if both u, v are assigned for the same register ri then 9: if one of this two node (assuming for u) can be changed to other register Set R' without violating the IG of current analysis then 10: rj:= rk where and satisfy 11: M(u) :=rk 12: else 13: assign the two nodes foe original registers. 14: end if 15: else if both u,v are not assigned for any register then 16: ri:= get the random register that node v can be used. 17: M.add (v, ri) 18: get the register with minimal cost crosstalk(ri, rj) 19: M.add (u,rj) 20: end if 21: else 22: if only one node (assuming for u) is assigned for register ri then 23: if ri without violate the conflict of other assignments till now then 24: M.add(v,ri) 25: else 26: assign as line 5-7 27: end if 28: else if none of node is assigned then 29: if exist an register rk can be used for both node without violate the conflict of the other assignments till now then 30: M.add (v,rk), M.add(u,rk)

Experimental setup for performance evaluation
The experiment is built up in Fedora 12 combined with Windows 7 home basic version. The test cases are selected from the MIBench [Guthaus, Ringenberg, Ernst et al. (2001)]. The compiler tools are arm-linux-gcc 4.4.3, and combined with objdump 2.19.51 to get the disassemble codes. The sim-profile tool for arm is used as profile tool to get the access frequency of instructions. The whole experimental framework is shown in Fig. 4. Firstly, we use the arm-linux-gcc to compile the source code to binary codes in Fedora 12 environment. Then, we disassemble and get profile information for the binary codes respectively. After getting the disassemble codes and profile information, we use them as input for the NARR processor to get the crosstalk aware optimized binary codes. Finally, we compare the source binary codes to the optimized binary codes to evaluate the performance of NARR, and analyze the improvement details in crosstalk reduction, especially for 3C and 4C ones. From this benchmark, we can see that the 4C crosstalk has been significantly reduced. In the cases such as stringsearch_large, stringsearch_small, dijkstra, and crc, the reduction percentage of 4C crosstalk is higher than 95%, eliminating almost all 4C crosstalk of the program. And the average decrease rate is about 81%. And for 3C+4C crosstalk, we can see that most of them are also significantly reduced except dijkstra since the dijkstra has many conflicts between the 3C and 4C crosstalk. We force a crosstalk avoid priority for 4C, so the dijkstra benchmark result is not so good in 3C+4C condition. However, we still get an excellent reduction rate under the major benchmark tests for 3C+4C and the average reduction rate of all tested benchmarks is about 44%.  Figure 5: 4C and 3C+4C crosstalk reducing in crosstalk number For better understanding the crosstalk avoid in instruction level, we also analyze the 4C and 3C+4C crosstalk in dynamic execution with profile recorded in Fig. 6. From this result, we can see that the 4C crosstalk shows again a good reduction and the average decreased percentage is 80.87%. The highest reduction rate is 99.99% for the crc test under that only two 4C crosstalk appeared in the program after optimization (shown in Tab. 2). And for 3C+4C crosstalk, the average reduction percentage is also 37.01%, similar to the results shown in Fig. 5. Special cases are, however, again a smaller reduction rate recorded in 3C+4C crosstalk condition for stringsearch_large, patricia and dijjkstra tests. The main reason could be that in an instruction, there may be some crosstalk in the same class such as 3C. So if the instruction frequently executes, the crosstalk data at instruction level will be less than those at crosstalk number level. However, getting crosstalk statistics at instruction level is reasonable since the program is executed at instruction level and the possible attackers might also try to work in the instruction level to get the most detailed information of the system.  Figure 6: 4C and 3C+4C crosstalk reducing in execution instructions Tab. 3 shows the evaluation results of adapting NARR to reduce the 4C and 3C+4C crosstalk in the aspect of the whole executed instructions. We can see that after NARR, the crosstalk percentage is significantly reduced for almost every benchmark tested, in both 4C and 3C+4C cases, in comparison with GCC. The average percentage of 4C crosstalk is reduced to a level of 0.89%, compared with the initially compiled result of 9.89% with GCC (with a relative reduction rate of 91% based on the GCC value). Furthermore, under specific tests such as crc, dijkstra, etc, we get nearly 0 crosstalk in 4C situation after NARR. And the 3C+4C crosstalk is also reduced from 40.77% for GCC to 25.85% after NARR, in average. So the NARR method is good for reducing the crosstalk, especially for the 4C case. Tab. 4 shows all types of crosstalk decreased percentage compared to GCC by NARR.
We can see that the overall crosstalk is also decreased largely. For the bitcnts and qsort, the reduction rate is up to more than 44%. The average reduction rate is also achieved to 24.24%. So our NARR method is not can get good performance for crosstalk.