Abstract
With technology scaling, on-chip power dissipation and off-chip memory bandwidth have become significant performance bottlenecks in virtually all computer systems, from mobile devices to supercomputers. An effective way of improving performance in the face of bandwidth and power limitations is to rely on associative memory systems. Recent work on a PCM-based, associative TCAM accelerator shows that associative search capability can reduce both off-chip bandwidth demand and overall system energy. Unfortunately, previously proposed resistive TCAM accelerators have limited flexibility: only a restricted (albeit important) class of applications can benefit from a TCAM accelerator, and the implementation is confined to resistive memory technologies with a high dynamic range (RHigh/RLow), such as PCM.
This work proposes AC-DIMM, a flexible, high-performance associative compute engine built on a DDR3-compatible memory module. AC-DIMM addresses the limited flexibility of previous resistive TCAM accelerators by combining two powerful capabilities---associative search and processing in memory. Generality is improved by augmenting a TCAM system with a set of integrated, user programmable microcontrollers that operate directly on search results, and by architecting the system such that key-value pairs can be co-located in the same TCAM row. A new, bit-serial TCAM array is proposed, which enables the system to be implemented using STT-MRAM. AC-DIMM achieves a 4.2X speedup and a 6.5X energy reduction over a conventional RAM-based system on a set of 13 evaluated applications.
- Design Compiler Command-Line Interface Guide. http://www.synopsys.com/.Google Scholar
- Free PDK 45nm open-access based PDK for the 45nm technology node. http://www.eda.ncsu.edu/wiki/FreePDK.Google Scholar
- Advanced Micro Devices, Inc. AMD64 Architecture Programmer's Manual Volume 2: System Programming, 2010.Google Scholar
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th Very Large Databases Conference, Santioago de Chile, Chile, Sept. 1994. Google ScholarDigital Library
- F. Alibart, T. Sherwood, and D. Strukov. Hybrid CMOS/nanodevice circuits for high throughput pattern matching applications. In Adaptive Hardware and Systems (AHS), 2011 NASA/ESA Conference on, June 2011.Google ScholarCross Ref
- I. Arsovski, T. Chandler, and A. Sheikholeslami. A ternary content-addressable memory (TCAM) based on 4T static storage and including a current-race sensing scheme. Solid-State Circuits, Journal of, 38(1):155--158, Jan. 2003.Google Scholar
- D. Elliott, W. Snelgrove, and M. Stumm. Computational RAM: A memory-SIMD hybrid and its application to DSP. In Custom Integrated Circuits Conference, 1992., Proceedings of the IEEE 1992, pages 30.6.1--30.6.4, May 1992.Google ScholarCross Ref
- K. Eshraghian, K.-R. Cho, O. Kavehei, S.-K. Kang, D. Abbott, and S.-M. S. Kang. Memristor MOS content addressable memory (MCAM): Hybrid architecture for future high performance search engines. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 19(8):1407--1417, Aug. 2011. Google ScholarDigital Library
- R. Foss and A. Roth. Priority encoder circuit and method for content addressable memory. Technical Report Canadian Patent 2,365, 891, MOSAID Technologies Inc., Apr. 2003.Google Scholar
- A. Goel and P. Gupta. Small subset queries and bloom filters using ternary associative memories, with applications. In Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems, SIGMETRICS '10, pages 143--154, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- M. Gokhale, B. Holmes, and K. Iobst. Processing in memory: the terasys massively parallel PIM array. Computer, 28(4):23--31, Apr. 1995. Google ScholarDigital Library
- Q. Guo, X. Guo, Y. Bai, and E. İpek. A resistive TCAM accelerator for data-intensive computing. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, pages 339--350, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop, Washington, DC, USA, 2001. Google ScholarDigital Library
- A. Hashmi and M. Lipasti. Accelerating search and recognition with a TCAM functional unit. In Computer Design, 2008. IEEE International Conference on, Oct. 2008.Google ScholarCross Ref
- J. L. Henning. SPEC CPU2000: Measuring CPU performance in the new millennium. IEEE Computer, 33(7):28--35, July 2000. Google ScholarDigital Library
- Y. Huai. Spin-transfer torque MRAM (STT-MRAM) challenges and prospects. AAPPS Bulletin, 18(6):33--40, Dec. 2008.Google Scholar
- Intel Corporation. IA-32 Intel Architecture Optimization Reference Manual, 2003.Google Scholar
- ITRS. International Technology Roadmap for Semiconductors: 2010 Update. http://www.itrs.net/links/2010itrs/home2010.htm.Google Scholar
- M. Joshi, G. Karypis, and V. Kumar. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In IPPS, 1998. Google ScholarDigital Library
- Kawahara, T. and Takemura, R. and Miura, K. and Hayakawa, J. and Ikeda, S. and Young Min Lee and Sasaki, R. and Goto, Y. and Ito, K. and MEGURO, T. and Matsukura, F. and Takahashi, Hiromasa and Matsuoka, Hideyuki and OHNO, H. 2 Mb SPRAM (spin-transfer torque RAM) with bit-by-bit bi-directional current write and parallelizing-direction current read. IEEE Journal of Solid-State Circuits, 43(1):109--120, Jan. 2008.Google Scholar
- O. D. Kretser and A. Moffat. Needles and haystacks: A search engine for personal information collections. In Australasian Computer Science Conference, 2000.Google Scholar
- K. Lakshminarayanan, A. Rangarajan, and S. Venkatachary. Algorithms for advanced packet classification with ternary CAMs. In Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications, SIGCOMM '05, pages 193--204, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In International Symposium on Computer Architecture, 2009. Google ScholarDigital Library
- L.-Y. Liu, J.-F. Wang, R.-J. Wang, and J.-Y. Lee. CAM-based VLSI architectures for dynamic Huffman coding. In Consumer Electronics, 1994. Digest of Technical Papers., IEEE International Conference on, June 1994.Google Scholar
- M. Madec, J. Kammerer, and L. Hebrard. Compact modeling of a magnetic tunnel junction part II: Tunneling current model. Electron Devices, IEEE Transactions on, 57(6):1416--1424, 2010.Google ScholarCross Ref
- S. Matsunaga, K. Hiyama, A. Matsumoto, S. Ikeda, H. Hasegawa, K. Miura, J. Hayakawa, T. Endoh, H. Ohno, and T. Hanyu. Standby-power-free compact ternary content-addressable memory cell chip using magnetic tunnel junction devices. Applied Physics Express, 2(2):023004, 2009.Google ScholarCross Ref
- S. Matsunaga, A. Katsumata, M. Natsui, S. Fukami, T. Endoh, H. OHNO, and T. Hanyu. Fully parallel 6T-2MTJ nonvolatile TCAM with single-transistor-based self match-line discharge control. In VLSI Circuits (VLSIC), 2011 Symposium on, June 2011.Google Scholar
- A. J. Mcauley and P. Francis. Fast routing table lookup using CAMs. In IEEE INFOCOM, pages 1382--1391, 1993.Google ScholarCross Ref
- D. McGrath. Everspin samples 64Mb spin-torque MRAM. EETimes, Nov. 2012. http://www.eetimes.com/design/memory-design/4401052/Everspin-samples-64--Mb-spin-torque-MRAM?pageNumber=0.Google Scholar
- M. Meribout, T. Ogura, and M. Nakanishi. On using the CAM concept for parametric curve extraction. Image Processing, IEEE Transactions on, 9(12):2126--2130, Dec. 2000. Google ScholarDigital Library
- Micron Technology, Inc., MT41J128M8. 1Gb DDR3 SDRAM, 2006.Google Scholar
- R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. MineBench: A benchmark suite for data mining workloads. In Workload Characterization, 2006 IEEE International Symposium on, Oct. 2006.Google ScholarCross Ref
- M. Oskin, F. Chong, and T. Sherwood. Active pages: a computation model for intelligent memory. In Computer Architecture, 1998. Proceedings. The 25th Annual International Symposium on, 1998. Google ScholarDigital Library
- S. Panchanathan and M. Goldberg. A content-addressable memory architecture for image coding using vector quantization. Signal Processing, IEEE Transactions on, 39(9):2066--2078, Sept. 1991. Google ScholarDigital Library
- S. S. P. Parkin, C. Kaiser, A. Panchula, P. M. Rice, B. Hughes, M. Samant, and S. H. Yang. Giant tunnelling magnetoresistance at room temperature with MgO (100) tunnel barriers. Nature Materials, 3(12):862--867, 2004.Google ScholarCross Ref
- T.-B. Pei and C. Zukowski. VLSI implementation of routing tables: tries and CAMs. In INFOCOM '91. Proceedings. Tenth Annual Joint Conference of the IEEE Computer and Communications Societies., Apr. 1991.Google ScholarCross Ref
- J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi. NU-MineBench 2.0. Technical report, Northwestern University, August 2005. Tech. Rep. CUCIS-2005-08-01.Google Scholar
- J. Potter, J. Baker, S. Scott, A. Bansal, C. Leangsuksun, and R. Asthagiri. ASC: An associative computing paradigm. Special Issue on Associative Processing, IEEE Computer, 1994. Google ScholarDigital Library
- B. Rajendran, R. Cheek, L. Lastras, M. Franceschini, M. Breitwisch, A. Schrott, J. Li, R. Montoye, L. Chang, and C. Lam. Demonstration of CAM and TCAM using phase change devices. In Memory Workshop (IMW), 2011 3rd IEEE International, May 2011.Google ScholarCross Ref
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture, 2007. Google ScholarDigital Library
- J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, Jan. 2005. http://sesc.sourceforge.net.Google Scholar
- S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In Proceedings of the 27th annual international symposium on Computer architecture, ISCA-27, 2000. Google ScholarDigital Library
- S. Sharma and R. Panigrahy. Sorting and searching using ternary CAMs. In High Performance Interconnects, 2002. Proceedings. 10th Symposium on, 2002. Google ScholarDigital Library
- R. Shinde, A. Goel, P. Gupta, and D. Dutta. Similarity search and locality sensitive hashing using ternary content addressable memories. In Proceedings of the 2010 international conference on Management of data, SIGMOD '10, 2010. Google ScholarDigital Library
- K. Tsuchida, T. Inaba, K. Fujita, Y. Ueda, T. Shimizu, Y. Asao, T. Kajiyama, M. Iwayama, K. Sugiura, S. Ikegawa, T. Kishi, T. Kai, M. Amano, N. Shimomura, H. Yoda, and Y. Watanabe. A 64Mb MRAM with clamped-reference and adequate-reference schemes. In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International, pages 258--259, Feb. 2010.Google ScholarCross Ref
- W. Xu, T. Zhang, and Y. Chen. Design of spin-torque transfer magnetoresistive RAM and CAM/TCAM with high sensing and search speed. IEEE Transactions on Very Large Scale Integration Systems, 18(1):66--74, Jan 2010. Google ScholarDigital Library
- Z. Zhang, Z. Zhu, and X. Zhang. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In MICRO-33, 2000. Google ScholarDigital Library
- W. Zhao and Y. Cao. New generation of predictive technology model for sub-45nm design exploration. In International Symposium on Quality Electronic Design, 2006. http://ptm.asu.edu/. Google ScholarDigital Library
- J.-G. Zhu. Magnetoresistive random access memory: The path to competitiveness and scalability. Proceedings of the IEEE, 96(11):1786--1798, Nov. 2008.Google ScholarCross Ref
Index Terms
- AC-DIMM: associative computing with STT-MRAM
Recommendations
AC-DIMM: associative computing with STT-MRAM
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureWith technology scaling, on-chip power dissipation and off-chip memory bandwidth have become significant performance bottlenecks in virtually all computer systems, from mobile devices to supercomputers. An effective way of improving performance in the ...
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureThe widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To ...
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices
The widespread use of multicore processors has dramatically increased the demands on high bandwidth and large capacity from memory systems. In a conventional DDR2/DDR3 DRAM memory system, the memory bus and DRAM devices run at the same data rate. To ...
Comments