Abstract
The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CPU-centric architectures stagnates, advancing compute capabilities necessitates novel architectural approaches. Near-memory processing (NMP) architectures are reemerging as promising candidates to improve computing efficiency through tight coupling of logic and memory. NMP architectures are especially fitting for data analytics, as they provide immense bandwidth to memory-resident data and dramatically reduce data movement, the main source of energy consumption.
Modern data analytics operators are optimized for CPU execution and hence rely on large caches and employ random memory accesses. In the context of NMP, such random accesses result in wasteful DRAM row buffer activations that account for a significant fraction of the total memory access energy. In addition, utilizing NMP's ample bandwidth with fine-grained random accesses requires complex hardware that cannot be accommodated under NMP's tight area and power constraints. Our thesis is that efficient NMP calls for an algorithm-hardware co-design that favors algorithms with sequential accesses to enable simple hardware that accesses memory in streams. We introduce an instance of such a co-designed NMP architecture for data analytics, the Mondrian Data Engine. Compared to a CPU-centric and a baseline NMP system, the Mondrian Data Engine improves the performance of basic data analytics operators by up to 49x and 5x, and efficiency by up to 28x and 5x, respectively.
- Daniel Abadi, Peter A. Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel Madden. 2013. The Design and Implementation of Modern Column-Oriented Database Systems. Foundations and Trends in Databases 5, 3 (2013), 197--280.Google ScholarDigital Library
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA 2015). 105--117. Google ScholarDigital Library
- Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA 2015). 336--348. Google ScholarDigital Library
- Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Data reorganization in memory using 3D-stacked DRAM. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA 2015). 131--143. Google ScholarDigital Library
- AMD. 2016. High Bandwidth Memory, Reinventing Memory Technology. (2016). Retrieved April 26, 2017 from http://www.amd.com/en-us/innovations/software-technologies/hbm.Google Scholar
- ARM. 2017. Cortex-A35 Processor. (2017). Retrieved April 26, 2017 from https://www.arm.com/products/processors/cortex-a/cortex-a35-processor.php.Google Scholar
- Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proceedings of the ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2012). 53--64. Google ScholarDigital Library
- Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M. Tamer Özsu. 2013. Multi-core, Main-memory Joins: Sort vs. Hash Revisited. Proceedings of the VLDB Endowment 7, 1 (Sept. 2013), 85--96. Google ScholarDigital Library
- Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Ozsu. 2013. Multicore hash joins source code. (2013). Retrieved April 26, 2017 from https://www.systems.ethz.ch/node/334/.Google Scholar
- Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Özsu. 2013. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the 29th International Conference on Data Engineering, (ICDE 2013). 362--373. Google ScholarDigital Library
- Spyros Blanas, Yinan Li, and Jignesh M. Patel. 2011. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2011). 37--48. Google ScholarDigital Library
- Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In Preceedings of the Second Biennial Conference on Innovative Data Systems Research (CIDR 2005). 225--237. http://www.cidrdb.org/cidr2005/papers/P19.pdfGoogle Scholar
- John B. Carter, Wilson C. Hsieh, Leigh Stoller, Mark R. Swanson, Lixin Zhang, Erik Brunvand, Al Davis, Chen-Chi Kuo, Ravindra Kuramkote, Michael A. Parker, Lambert Schaelicke, and Terry Tateyama. 1999. Impulse: Building a Smarter Memory Controller. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture (HPCA 1999). 70--79. Google ScholarDigital Library
- Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE 2012). 33--38. Google ScholarDigital Library
- Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2014). 269--284. Google ScholarDigital Library
- Bill Dally. 2015. Keynote: Challenges for Future Computing Systems. (2015). Retrieved April 26, 2017 from https://www.cs.colostate.edu/~cs575dl/Sp2015/Lectures/Dally2015.pdf.Google Scholar
- Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74--80. Google ScholarDigital Library
- Hewlett-Packard Enterprise. 2015. The Machine: A new kind of computer. (2015). Retrieved April 26, 2017 from http://www.labs.hpe.com/research/themachine/.Google Scholar
- Babak Falsafi, Mircea Stan, Kevin Skadron, Nuwan Jayasena, Yunji Chen, Jinhua Tao, Ravi Nair, Jaime H. Moreno, Naveen Muralimanohar, Karthikeyan Sankaralingam, and Cristian Estan. 2016. Near-Memory Data Services. IEEE Micro 36, 1 (2016), 6--13. Google ScholarDigital Library
- Michael Ferdman, Almutaz Adileh, Yusuf Onur Koçberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2012). 37--48. Google ScholarDigital Library
- Apache Software Foundation. 2017. Apache Spark. (2017). Retrieved April 26, 2017 from http://spark.apache.org/.Google Scholar
- Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical Near-Data Processing for In-Memory Analytics Frameworks. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT 2015). 113--124. Google ScholarDigital Library
- Mingyu Gao and Christos Kozyrakis. 2016. HRL: Efficient and flexible reconfigurable logic for near-data processing. In Proceedings of the 2016 International Symposium on High Performance Computer Architecture (HPCA 2016). 126--137.Google ScholarCross Ref
- Boris Grot, Joel Hestness, Stephen W. Keckler, and Onur Mutlu. 2011. Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA 2011). 401--412. Google ScholarDigital Library
- Linley Group. 2015. Hexagon 680 Adds Vector Extensions. Mobile Chip Report (September 2015).Google Scholar
- Linley Gwennap. 2013. Qualcomm Krait 400 hits 2.3 GHz. Microprocessor report 27, 1 (January 2013), 1--6.Google Scholar
- Linley Gwennap. 2015. Cortex-A35 Extends Low End. Microprocessor Report 29, 11 (November 2015), 1--10.Google Scholar
- Mary W. Hall, Peter M. Kogge, Jefferey G. Koller, Pedro C. Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John J. Granacki, Jay B. Brockman, Apoorv Srivastava, William C. Athas, Vincent W. Freeh, Jaewook Shin, and Joonseok Park. 1999. Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing, (SC 1999). 57. Google ScholarDigital Library
- Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the 43rd Annual International Symposium on Computer Architecture (ISCA 2016). 243--254. Google ScholarDigital Library
- Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2011. Toward Dark Silicon in Servers. IEEE Micro 31, 4 (2011), 6--15. Google ScholarDigital Library
- Mark Harris. 2013. Unified memory in CUDA 6. (2013). Retrieved April 26, 2017 from http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3120-Unified-Memory-CUDA-6.0.pdfGoogle Scholar
- IBM. 2017. IBM DB2. (2017). Retrieved April 26, 2017 from http://www.ibm.com/analytics/us/en/technology/db2/.Google Scholar
- Joe Jeddeloh and Brent Keeth. 2012. Hybrid memory cube new DRAM architecture increases density and performance. In VLSI Technology (VLSIT), 2012 Symposium on. IEEE, 87--88.Google ScholarCross Ref
- JEDEC. 2013. Wide I/O 2 Standard. (2013). Retrieved April 26, 2017 from http://www.jedec.org/standards-documents/results/jesd229-2.Google Scholar
- JEDEC. 2015. High Bandwidth Memory (HBM) DRAM. (2015). Retrieved April 26, 2017 from https://www.jedec.org/standards-documents/docs/jesd235a.Google Scholar
- Svilen Kanev, Juan Pablo Darago, Kim M. Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David M. Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA 2015). 158--169. Google ScholarDigital Library
- Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Vi Lam, Josep Torrellas, and Pratap Pattnaik. 1999. FlexRAM: Toward an Advanced Intelligent Memory System. In Proceedings of the IEEE International Conference On Computer Design, VLSI in Computers and Processors, (ICCD 1999). 192--201. Google ScholarDigital Library
- Changkyu Kim, Tim Kaldewey, Victor W. Lee, Eric Sedlar, Anthony D. Nguyen, Nadathur Satish, Jatin Chhugani, Andrea Di Blas, and Pradeep Dubey. 2009. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-core CPUs. Proceedings of the VLDB Endowment 2, 2 (Aug. 2009), 1378--1389. Google ScholarDigital Library
- Yusuf Onur Koçberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin T. Lim, and Parthasarathy Ranganathan. 2013. Meet the walkers: accelerating index traversals for in-memory databases. In Proceedings of the 46th Annual International Symposium on Microarchitecture (MICRO 2013). 468--479. Google ScholarDigital Library
- Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-store 7 Years Later. Proceedings of the VLDB Endowment 5, 12 (Aug. 2012), 1790--1801. Google ScholarDigital Library
- Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the 2011 International Conference on Computer-Aided Design (ICCAD 2011). 694--701. Google ScholarDigital Library
- Kevin T. Lim, Jichuan Chang, Trevor N. Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated memory for expansion and sharing in blade servers. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA 2009). 267--278. Google ScholarDigital Library
- Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2002. Optimizing Main-Memory Join on Modern Hardware. IEEE Trans. Knowl. Data Eng. 14, 4 (2002), 709--730. Google ScholarDigital Library
- Mozhgan Mansuri, James E. Jaussi, Joseph T. Kennedy, Tzu-Chien Hsueh, Sudip Shekhar, Ganesh Balamurugan, Frank O'Mahony, Clark Roberts, Randy Mooney, and Bryan Casper. 2013. A Scalable 0.128-1 Tb/s, 0.8-2.6 pJ/bit, 64-Lane Parallel I/O in 32-nm CMOS. J. Solid-State Circuits 48, 12 (2013), 3229--3242.Google ScholarCross Ref
- MEMSQL. 2017. MEMSQL: The Fastest In-Memory Database. (2017). Retrieved April 26, 2017 from http://www.memsql.com/.Google Scholar
- Micron. 2014. Hybrid Memory Cube Second Generation. (2014). Retrieved April 26, 2017 from http://investors.micron.com/releasedetail.cfm?ReleaseID=828028.Google Scholar
- Micron. 2017. DDR3 SDRAM System-Power Calculator. (2017). Retrieved April 26, 2017 from https://www.micron.com/support/tools-and-utilities/power-calc.Google Scholar
- Nooshin Mirzadeh, Yusuf Onur Koçberber, Babak Falsafi, and Boris Grot. 2015. Sort vs. hash join revisited for near-memory execution. In Proceedings of the 5th Workshop on Architectures and Systems for Big Data (ASBD 2015). http://acs.ict.ac.cn/asbd2015/papers/ASBD_2015_submission_3.pdfGoogle Scholar
- Cavium Networks. 2014. Cavium Announces Availability of ThunderX: Industry's First 48 Core Family of ARMv8 Workload Optimized Processors for Next Generation Data Center & Cloud Infrastructure. (2014). Retrieved April 26, 2017 from http://www.cavium.com/newsevents-Cavium-Announces-Availability-of-ThunderX.html.Google Scholar
- Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. 2015. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD 2015). 677--689. Google ScholarDigital Library
- Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling Memcache at Facebook. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2013). USENIX, 385--398. https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala Google ScholarDigital Library
- Mark Oskin, Frederic T. Chong, and Timothy Sherwood. 1998. Active Pages: A Computation Model for Intelligent Memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA 1998). 192--203. Google ScholarDigital Library
- John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Diego Ongaro, Guru M. Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2011. The case for RAMCloud. Commun. ACM 54, 7 (2011), 121--130. Google ScholarDigital Library
- D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. 1997. A case for intelligent RAM. IEEE Micro 17, 2 (Mar 1997), 34--44. Google ScholarDigital Library
- Javier Picorel, Djordje Jevdjic, and Babak Falsafi. 2016. Near-Memory Address Translation. CoRR abs/1612.00445 (2016). http://arxiv.org/abs/1612.00445Google Scholar
- Seth H. Pugsley, Jeffrey Jestes, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads. IEEE Micro 34, 4 (2014), 44--52.Google ScholarCross Ref
- Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Proceedings of the 2014 International Symposium on Performance Analysis of Systems and Software (ISPASS 2014). 190--200.Google ScholarCross Ref
- Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2011. DRAMSim2: A Cycle Accurate Memory System Simulator. Computer Architecture Letters 10, 1 (2011), 16--19. Google ScholarDigital Library
- P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. 1979. Access Path Selection in a Relational Database Management System. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data (SIGMOD 1979). ACM, New York, NY, USA, 23--34. Google ScholarDigital Library
- Minglong Shao, Anastassia Ailamaki, and Babak Falsafi. 2005. DBmbench: fast and accurate database workload representation on modern microarchitecture. In Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative Research. 254--267. Google ScholarDigital Library
- R. Sivaramakrishnan and S. Jairath. 2014. Next generation SPARC processor cache hierarchy. In IEEE Hot Chips 26 Symposium (HCS), 2014. 1--28.Google ScholarCross Ref
- Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB Main Memory DBMS. IEEE Data Eng. Bull. 36, 2 (2013), 21--27. http://sites.computer.org/debull/A13june/VoltDB1.pdfGoogle Scholar
- Tezzaron. 2017. DiRAM4 3D Memory. (2017). Retrieved April 26, 2017 from http://www.tezzaron.com/products/diram4-3d-memory/.Google Scholar
- Stavros Volos, Djordje Jevdjic, Babak Falsafi, and Boris Grot. 2017. Fat Caches for Scale-Out Servers. IEEE Micro 37, 2 (2017), 90--103. Google ScholarDigital Library
- Stavros Volos, Javier Picorel, Babak Falsafi, and Boris Grot. 2014. BuMP: Bulk Memory Access Prediction and Streaming. In Proceedings of the 47th Annual International Symposium on Microarchitecture (MICRO 2014). 545--557. Google ScholarDigital Library
- Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2008. Temporal streams in commercial server applications. In 4th International Symposium on Workload Characterization (IISWC 2008). 99--108.Google ScholarCross Ref
- Thomas F. Wenisch, Roland E. Wunderlich, Michael Ferdman, Anastassia Ailamaki, Babak Falsafi, and James C. Hoe. 2006. SimFlex: Statistical Sampling of Computer System Simulation. IEEE Micro 26, 4 (2006), 18--31. Google ScholarDigital Library
- Lisa Wu, Raymond J. Barker, Martha A. Kim, and Kenneth A. Ross. 2013. Navigating big data with high-throughput, energy-efficient data partitioning. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA 2013). 249--260. Google ScholarDigital Library
- Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. 2014. Q100: the architecture and design of a database processing unit. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2014). 255--268. Google ScholarDigital Library
- Roland E. Wunderlich, Thomas F. Wenisch, Babak Falsafi, and James C. Hoe. 2003. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA 2003). 84--95. Google ScholarDigital Library
- Marcin Zukowski, Mark van de Wiel, and Peter A. Boncz. 2012. Vectorwise: A Vectorized Analytical DBMS. In Proceedings of the 28th International Conference on Data Engineering (ICDE 2012). 1349--1350. Google ScholarDigital Library
Index Terms
- The Mondrian Data Engine
Recommendations
The Mondrian Data Engine
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureThe increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CPU-centric architectures stagnates, ...
Hydra: a near hybri<u>d</u> memo<u>r</u>y <u>a</u>ccelerator for CNN inference
DATE '22: Proceedings of the 2022 Conference & Exhibition on Design, Automation & Test in EuropeConvolutional neural network (CNN) accelerators often suffer from limited off-chip memory bandwidth and on-chip capacity constraints. One solution to this problem is near-memory or in-memory processing. Non-volatile memory, such as phase-change memory (...
G-NMP: Accelerating Graph Neural Networks with DIMM-based Near-Memory Processing
AbstractGraph Neural Networks (GNNs) are of great value in numerous applications and promote the development of cognitive intelligence, due to the capability of modeling non-euclidean data structures. However, the inherent irregularity makes GNNs memory-...
Highlights- G-NMP exploits rank-level parallelism and leverages off-the-shelf CPU and DRAM chips.
- G-ISA instruction sets reduces memory requests and alleviates C/A bandwidth pressure.
- Data flow optimization improves memory-compute overlap and ...
Comments