ABSTRACT
A wide range of real-world applications, including DSP, deep learning, multimedia, and scientific algorithms generally include fixed-point and floating-point arithmetic operations and trigonometric functions which have long latency and high power usage. In this paper, we propose a computation reuse mechanism for multicore processors that reuses the results of an arithmetic operation for subsequent operations with (approximately) the same operands. It adds a small so-called result cache to every functional unit that keeps a few recent operands and their results to detect repetitive operands and reuse the results. Taking advantage of the value locality inherent in many real-world applications, our architecture relies on a multi-stage interconnection network to distribute input data elements across the cores of a multi-core processor in such a way that the data locality of each core is increased. This way, each core has higher computation reuse rate that translates to more power consumption reduction. Experimental results show that the proposed mechanism increases the result cache hit rate, which leads to a significant reduction in power consumption of arithmetic operations.
- X. He, G. Yan, Y. Han and X. Li, "ACR: Enabling computation reuse for approximate computing," 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), Macau, 2016, pp. 643--648. Google ScholarDigital Library
- A. Yasoubi, R. Hojabr and M. Modarressi, "Power-Efficient Accelerator Design for Neural Networks Using Computation Reuse," in IEEE Computer Architecture Letters, Vol. 16, no. 1, pp. 72--75, Jan.-June 1 2017.Google ScholarDigital Library
- X. He, G. Yan, F. Sun, Y. Han and X. Li, "ApproxEye: Enabling approximate computation reuse for microrobotic computer vision," 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), Chiba, 2017, pp. 402--407. Google ScholarDigital Library
- M. Modarressi, S. H. Nikounia and A. H. Jahangir, "Low-power arithmetic unit for DSP applications," International Symposium on System on Chip (SoC), 2011, pp. 68--71. Google ScholarCross Ref
- C. Alvarez, J. Corbal, and M. Valero, "Fuzzy memoization for floating-point multimedia applications" in IEEE Transactions on Computers, Vol. 54, No. 7, pp. 922--927, July 2005. Google ScholarDigital Library
- H. Esmaeilzadeh, et al., "Architecture support for disciplined approximate programming," International conference on Architectural Support for Programming Languages and Operating System, pp. 301--312, 2011.Google Scholar
- Y. Tong, R. Rutenbar, and D.F. Nagle, "Minimizing floating-point power dissipation via bit-width reduction", in Proc. of Power-Driven Microarchitecture Workshop, 1998.Google Scholar
- H. Lee, "A power-aware scalable pipelined Booth multiplier," in IEEE International Systems-On-Chip Conference, pp. 123--126, 2006.Google Scholar
- Moldovan, Dan I. Parallel processing from applications to systems. Elsevier, 2014.Google Scholar
- J. Duato, S. Yalamanchili, and L. Ni. Interconnection Networks. Morgan Kaufmann Publishers Inc., 2002.Google Scholar
- R. Sabbaghi-Nadooshan, M. Modarressi and H. Sarbazi-Azad, "The 2D DBM: An attractive alternative to the simple 2D mesh topology for on-chip networks," IEEE International Conference on Computer Design, 2008, pp. 486--490. Google ScholarCross Ref
- A. Oppenheim, et al., Discrete-time Signal Processing, Prentice Hall Pubs., 1999.Google ScholarDigital Library
- www.seas.ucla.edu/~ingrid/ee213a/speech/speech.html, Jul 2017.Google Scholar
- S. Thoziyoor, N. Muralimanohar, J. H. Ahn and N. P. Jouppi, "CACTI 5.1", Technical Report HPL-2008-20, HP Laboratories, 2008.Google Scholar
Index Terms
- Low-power Parallel Data Processing Using Computation Reuse
Recommendations
A Low Power Correlator for CDMA Wireless Systems
The complex valued matched filter correlators consume maximum power in the DS/SS CDMA receivers. These correlators accumulate 1024 samples lying in the range 7 to +7. This accumulation needs 3 data bits, 1 sign bit and 10 extra bits for overflow. Hence, ...
A low-power handheld GPU using logarithmic arithmetic and triple DVFS power domains
GH '07: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardwareIn this paper, a low-power GPU architecture is described for the handheld systems with limited power and area budgets. The GPU is designed using logarithmic arithmetic for power- and area-efficient design. For this GPU, a multifunction unit is proposed ...
Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support
The demand for improved SIMD floating-point performance on general-purpose x86-compatible microprocessors is rising. At the same time, there is a conflicting demand in the low-power computing market for a reduction in power consumption. Along with this, ...
Comments