Retraction Notice : Method for Ultra-precision FPU Integration based on Fine-Grained Control

Bentham Open Disclaimer: It is a condition of publication that manuscripts submitted to this journal have not been published and will not be simultaneously submitted or published elsewhere. Furthermore, any data, illustration, structure or table that has been published elsewhere must be reported, and copyright permission for reproduction must be obtained. Plagiarism is strictly forbidden, and by submitting the article for publication the authors agree that the publishers have the legal right to take appropriate action against the authors, if plagiarism or fabricated information is discovered.


INTRODUCTION
Various soft computing solutions can be further illustrated based on some particle swarm optimization (PSO) and artificial neural network(ANN) models [1][2][3][4] which are computationally time-consuming or may need parameter estimation [5,6].In fact, in addition to model simulation, scientific and real-life applications also have also more critical requirements for the floating-point performance and data accuracy of embedded processor [7].Nowadays, although the vast majority of processors integrate double precision hardware float point unit (FPU) to improve floating-point performance and data accuracy [8][9][10][11], which can hardly satisfy the actual application.
Ultra-precision, an industrial standard developed by Intel Corporation, which means floating-data precision exceeding double precision, has the ability to meet the requirements in data accuracy.However, the ultra-precision computing is achieved by software in the contemporary embedded fields [12], which dramatically reduces overall performance of processor [13,14].As a result, the ultra-precision FPU integration in Reduced Instruction Set Computer (RISC) processor is an important ongoing research of processor design.
The ultra-precision FPU integration is very complicated as the pipeline state of processor must be taken into consideration.Several published methods for FPU integration cannot be used to ultra-precision FPU integration and have relatively low efficiency because these methods decouple the communication between FPU and processor.Thus, software intervention is needed in floating-point operation.Previous work on FPU integration will be described in detail in section II.
To solve the problem above, a fine-grained integration method for ultra-precision FPU, which based on centralized control and data segmentation, is proposed.The method considers fully pipeline state of processor and makes FPU and processor tightly coupled, which is implemented by appropriative hard modules.Meanwhile, it regards execution status of floating-point instructions as basic granularity to implement the precise control of FPU and to simplify the design complexity.Compared with studies published elsewhere, the main contributions of this paper are as follows: (1) For the first time, this paper discloses the integration method of ultra-precision FPU into pipeline RISC processors with no need to change the existing microprocessor module.Based on the proposed method, an 80-bit FPU has been integrated into the Scalable Processor Architecture version8 (SPARC V8) processor.
(2) The FPU execution efficiency based on the proposed approach is very high as it is implemented by hardware and there is no need for software intervention in floating-point operation.
This paper is organized as follows.In Section II, related work published is introduced.Moreover, in section III, the ultra-precision FPU integration method based on centralized control and data segmentation is proposed, and is used to apply an ultra-precision FPU (80 bits, Meiko interface and compatible with Intel floating-point coprocessor) embedded into the SPARC V8 LEON2 processor of five-stage pipeline [15].In Section IV, implementation results of proposed method will be contrasted with several published mechanisms.Finally, in Section V, the conclusions will be presented.

RELATED WORK
The FPU integration methods have been described in a number of literatures.Schwarz and Trong [16,17] introduce the implementation of high precision FPU, and Yong makes an 80-bit FPU embedded into the X86 processor using the micro-instruction code stored in the ROM [18].The fetch of micro-instruction code would consume processor execution time, which reduces floating-point efficiency.And due to the difference in processor architecture, it is infeasible for microinstruction code to apply to pipeline RISC processors.
Joven, Gajjar and Du [14,19,20] attenuate the degree of coupling between FPU and processor in which FPU serves as a slave unit of on-chip bus, and the calculation process of FPU is controlled by st/ld instructions.The above schemes need software intervention in floating-point operation and increase the access conflicts of on-chip bus which causes FPU efficiency extremely low.Although some effective measures, FPU dedicated data bus and more effective interactive approach of processor and FPU, are taken to improve the calculation efficiency and the result is still not ideal.
Brunelli [21] integrates a reconfigurable FPU into the main processor core by aa universal I/O interface containing data and control bus.This way is easy to implement, but not to consider the processor pipeline status.IBM developed a dedicated interface for FPU integration [22,23], namely auxiliary processor unit interface APU).The APU connects into the processor instruction pipeline and has the ability to negotiate the transfer of particular instructions and data to FPU.The IBM's solution is very efficient but unsuitable for ultraprecision FPU integration.The proposed ultra-precision FPU integration method considers fully instruction pipeline state of processor and makes FPU and processor tightly coupled, where software intervention is not needed.Thus, communication overheads between FPU and processor can be ignored.At the same time, the design based on fine-grained control can simplify implementation of ultra-precision FPU integration.All that can achieve a significant improvement on floating-point calculation efficiency and play an important role in reducing the hardware overheads.

CONTROL ALGORITHM AND ITS IMPLEMEN-TATION
The implementation of the proposed method, only needs to add control logics of FPU in different pipeline stages with no changing the rest processor modules.The control algorithm and its implementation of the proposed method will be presented by taking a five-stage pipeline RISC processor [24] for example, in which fine-grained centralized control is implemented in instruction decoder stage(ID) whereas data segmentation relates to execution(EX), memory ac- cess(MA) and write back stages(WB).

Principles of Fine-grained Control
Floating-point instructions achieve the conversion and operation of floating-point data.Its classification shown in Table 1, the precision type of source and destination operand is indicated with S and Q respectively and both include integer (I), single precision (S), double precision(D) and ultraprecision (U).So SDDS refers to the instructions with source and destination operands being double and single precision.Depending on the precision, the 15 types of floating-point instruction are divided into three types: S, D and U (Table 1 indicates with green, red and blue), and the operand width of S, D and U is 32, 64 and 80 bits.The proposed method further subdivides the execution status of D and U.By different status, the wide operand writing to narrow floating-point register file is achieved through pipeline.
The fine-grained control is implemented by state machine analyzing floating-point instructions.As shown in Fig. (1), there are four S states corresponding to three kinds of instruction: S, D and U, and X.S0 is a shared state by all three kinds of instructions.In X.S0, three categories of instructions will be finely differentiated.If the S type of instructions or control hazard exists in X.S0 state, state machine remains unchanged.While there are D or U instructions in X.S0, state machine will move D.S1 or U.S1 respectively when pipeline enable is active (hold=1).Others than X.S0, PC update is prohibited to prevent new instructions from getting into ID stage.Regarding the status of state machine shown in Fig. (1) as fundamental granularity, Control algorithm generates the FPU control information, and transfers the control information needed by destination operand to the   information, write enable and write address to next pipeline stage.Taking D.S1 for example, there are five kinds of instructions can enter the state as shown in Table 1 and only SDDI and SDDS whose destination operand width is less than the source operand.According to the principle (2): use high priority state to write back the least data, So in the state of D.S1, we set write enable valid(rd_wen='1') and gives the correct write-back address(pipeline.rd).

Segmentation of Data
As soon as FPU completes calculation, the EX, MA and WB stages of pipeline processor will register FPU output in segmentation and then move to the next stage.Finally, the registered data are written back to floating-point register file through pipeline.The algorithm of processing in segmentation is further described in Fig. (3).Firstly, algorithm estimates whether FPU output is normal or not.If there has an abnormal output, write-back of FPU output will be abolished and exception information will be submitted to the exception handling module.Otherwise, FPU output is registered in segmentation on the basis of instruction type and state information(pipeline.state), which makes one to one correspondence with the control information, write enable and write address, generated in Fig. (2b).Taking U instructions for example, according to the rules (2): use high priority state to write back the least data of FPU output, EX will register 32 bits output of FPU to pipeline register in "10" state for SUDI and SUDS who only have 32 bits destination operand.Yet for SUDD whose valid output is 64 bits, EX will register most 32 bits of FPU output to pipelined register in "10" state, and meanwhile MA register least 32 bits of FPU output in "01" state.However, for the other U kind of instructions owning 80 bits destination operand, EX holds the least 16 bits of 80-bit FPU output in "10" state, MA registers the middle of the 32 bits and WB stores the most 32 bits into pipelined registers in "00" state at the same time.

Implementation of FPU Integration Method
The SPARC V8 is only architecture that defines the quadruple-precision instruction (ultra-precision) and is fully open and non-proprietary [25].Other RISC architectures, achieving the ultra-precision floating operation through software, don't have ultra-precision floating-point instructions and need delegation of authority in the process of industrial application.Given all that, the SPARC V8 can make a thorough evaluation for our method with the help of open source implementations and open source simulator, and is chose to further illustrate the implementation of FPU integration method in RISC pipeline processor.

RESULTS OF IMPLEMENTATION AND TEST
Many verification methodologies can be used to verify the proposed method and their most difference for users is the programming language.In this case, the correctness and timing diagram of the proposed method has been verified based on Cadence's e Reuse Methodology(eRM) which is licensed by Cadence.Compared with TestFloat developed by the Stanford, the floating-point results of processor are correct.The typical timing diagram of proposed method is shown in Fig. (5) where (a) for SUDU floating-point instructions (both source and destination operands are ultraprecision 80 bits), and (b) for SSDU floating-point instructions(source operand is single precision and destination operand is ultra-precision).In Fig. (5a), ID stage takes three states of X.S0, U.S1 and U.S2 to prepare 80 bits source operand, and starts the FPU operation in U.S2. WB writes the destination operand using state of "00", "01" and "10".In Fig. (5b), ID stage prepares the 32 bits source operand and starts the FPU operation in X.S0, and subsequently generates the write enable and write address needed by destination operand in corresponding U.S1 and U.S2.When the FPU calculation is finished, the various pipeline stages(EX, MA, WB) will register the 80 bits FPU output in segmentation according to the information stored in pipeline registers and finally WB stage writes the destination operands to floatingpoint register file.
In order to carefully evaluate this method, some assumptions are made.(1) The design of FPU and timing constraint are the same in other benchmarks as the one which is employed the proposed method.(2) No FPU exception happens in the process of evaluating the floating-point efficiency.The differences between timing constraints may result 20% deviation compared with results at typical corner.Whereas the design and exception of FPU affect significantly results of floating-point efficiency.
The comparison in this section can be divided into three types: critical path delay, hardware overhead and floatingpoint efficiency.However, the evaluation of the integration method involves the design and implementation of float  point unit (FPU), which affects significantly comparison results.So only the same form as the ultra-precision FPU, based on Gaisler research's intellectual property core [26], is adopted as benchmark for fair and thorough comparison [14,18,19].
The analysis of critical path delay is firstly presented.In delay model, the critical path propagation delay is calculated based on delay at typical corner(2.5V,typical process, room temperature), derating factors of process, voltage and temperature.However, the propagation delay varies greatly from different derating factors of process, voltage and temperature, as is depicted in Fig. (6).The derating factors, defined as important parameters in delay module, illustrate influence on propagation delay of the process, voltage and temperature.The voltage and temperature derating factors(VDF and TDF) are chosen as the x and y coordinate axes and z axis indicates propagation delay.Three parametric surfaces correspond with process derating factors of slow, typical and fast.The derating factor at typical corner(KV, KC) indicated with '1' is constant and adopted as benchmark.With the rise of temperature and reduction in applied voltage, the delay does increase and the fast process derating factor has the minimal delay whereas the slow has maximal.Although all those factors can affect the delay, the most serious type is process derating factor.On the condition of the same typical process factor, the VDF and TDF make almost 18% variation on the propagation delay.In the same VDF and TDF, the propagation delay is 0.804, 1 and 1.15 corresponding to slow, typical and fast process factor.
The typical corner is standard application environment and is adopted as benchmark for synthesis, at which the critical path delay is only 3.7ns based on TSMC 0.25um library.The main reason is that the proposed method regards execution status of instructions as basic granularity to generate the FPU control information, which simplifies significantly the complexity of control logics.Compared with scheme based on micro-instruction code, the delay decreases by 37.3% at typical corner [18].From Fig. (6) we can draw similar conclusion based on other VDF and TDF.The proposed on method can actually lead significant induction in critical path delay.
Then, the evaluation of hardware overhead is done and the synthesis results obtained using the same FPGA device as Cortex-M1 integration is shown in Table 2.The overheads of LUTs and Flip Flops are 3585 and 1594 respectively, which declines by 16.9% compared with Cortex-M1 [14].The reason why hardware overhead uses less is that destination operand write-back in pipeline is adopted, which reuses many hardware resources.However, LUTs and registers consumption of the proposed method increased by 9.6% and 9% respectively compared with GRFPU LITE [19].It mainly contributes is that GRFPU LITE only support integration of the single and double precision FPU whereas the proposed method is suitable for single, double and ultraprecision FPU.
Finally, we evaluate the floating-point efficiency of the proposed method and further compare the results with publish mechanisms elsewhere [14,17,26].The results are shown in Fig (7).It takes 173 clock cycles for the LEON3 FPU to finish single and double precision floating-point operation as the LEON3 FPU is a slave unit of on chip bus, one way of loose coupling, and software intervention is needed in the operation.
The GRFPU and GRFPULITE of Gaisler research spend closely 30 clock cycles to complete the single and double  precision operation.The improvement of efficiency is the result of embedding FPU into processor core and implementing tightly coupling between processor and FPU.
The Cortex-M1 needs about 15~30 clocks to finish the execution of floating-point instructions by optimizing the interaction way between FPU and processor.However, the proposed method generates the FPU control information based on execution status, which advances the execution information to next pipeline stage during each clock and embeds FPU into processor cores by hardware.Thus, communication overheads between FPU and processor can be ignored.As the result of all factors, the proposed just need 9~10 clocks finish single and double-precision floating point instructions.The floating-point calculation efficiency increases 1.7 times than Cortex-M1.
The ultra-precision floating point operation is addressed by software imitating floating-point computing in mainstream embedded processors, which spends thousand of clock cycles.Fig. (7c) gives the ultra-precision floating point clock overheads of V8 processor based on the proposed method, and the efficiency is 20~100 times higher than software ultra-precision floating-point emulation [14,19].

CONCLUSION
This paper proposes a fine-grained integration method for ultra-precision FPU, which based on centralized control and data segmentation.The method generates the FPU control information corresponding to execution status and writes destination operands through pipeline, which can integrate 80-bit FPU to pipeline processor.The SPARC V8 processor with 80-bit FPU based on the proposed mechanism has been implemented, verified and analyzed.The results show that the critical path of floating instructions decreases by 37.3%, hardware consumption declines 16.9% and the floating-point calculation efficiency increase 1.7 times.This method can be used to embed ultra-high precision FPU to RISC processors.Nevertheless, it is recognized that there are limitations in the integration method for ultra-precision FPU.The efficient floating-exception handling, structure of register file and implementation in multi-core processor have not been considered here and there are some limitations in the assumptions used in this study.Therefore, improvement of the FPU integration method based on the proposed mechanism is actually in progress in our study.

Fig. ( 2 )
Fig. (2).Fine-grained control mechanism based on execution status.Control theory of destination operand is shown in Fig. (2b), pipeline means the data structure of pipeline registers.Two rules are defined in this method: (1) the priority of S2, S1 and S0 state decreases in turn; (2) use high priority state to write the least data of FPU output.Compare the width of source and destination operands in X.S0, D.S1, U.S1 and U.S2.Then based on the comparison result generates write enable and write address of destination operands needed by FPU output write-back, and transfers the control

Fig. ( 7 ).
Fig. (7).Efficiency comparison of floating point computation (a) the number of clock needed by single-precision (b) the number of clock needed by double-precision (c) the number of clock needed by ultra-precision in proposed method.

Table 1 . Precision type combination of floating-point operand. Destination Source Integer Type Single Precision Double Precision Ultra-Precision
Control algorithm of source operand is shown in Fig.(2a), regfile[rs] refers to the data in floating-point register file specified by source address(rs).In X.S0 state, control algorithm does not pass regfile[rs] to the least 32 bits of FPU input(fpui.rs) until all hazards disappear.In S1(D.S1 or U.S1), regfile[rs+1] will be assigned to the middle of the 32 bits of fpui.rs, while in the condition of S2 (U.S2), regfile[rs+2] will be connected to the most 16 bits of fpui.rs.At the same time, once the source operand being ready according to the source operand precision, FPU operation will be start(fpui.Start = '1').