ABSTRACT
Processing-in-memory (PIM) and in-storage-computing (ISC) architectures have been constructed to implement computation inside memory and near storage, respectively. While effectively mitigating the overhead of data movement from memory and storage to the processor, due to the limited bandwidth of existing systems, these architectures still suffer from the large data movement overhead between storage and memory, in particular, if the amount of required data is large. It has become a major constraint for further improving the computation efficiency in PIM and ISC architectures.
In this paper, we propose ParaBit, a scheme that enables Parallel Bitwise operations in NAND flash storage where data reside. By adjusting the latching circuit control and the sequence of sensing operations, ParaBit enables in-flash bitwise operation with no or little extra hardware, which effectively reduces the overhead of data movement between storage and memory. We exploit the massive parallelism in NAND flash based SSDs to mitigate the long latency of flash operations. Our experimental results show that the proposed ParaBit design achieves significant performance improvements over the state-of-the-art PIM and ISC architectures.
- Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D Davis, Mark S Manasse, and Rina Panigrahy. 2008. Design tradeoffs for SSD performance.. In ATC. USENIX.Google Scholar
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, and et al.2015. A scalable processing-in-memory accelerator for parallel graph processing. In ISCA. ACM.Google Scholar
- Berkin Akin, Franz Franchetti, and James C Hoe. 2015. Data reorganization in memory using 3D-stacked DRAM. In CAN. ACM.Google Scholar
- Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. In MICRO. IEEE.Google Scholar
- Julien Borghetti, Gregory S Snider, Philip J Kuekes, J Joshua Yang, Duncan R Stewart, and R Stanley Williams. 2010. ‘Memristive’switches enable ‘stateful’logic operations via material implication. In Nature. Nature Publishing Group.Google Scholar
- James Bruce, Tucker Balch, and Manuela Veloso. 2000. Fast and inexpensive color image segmentation for interactive robots. In IROS. IEEE.Google ScholarCross Ref
- Yu Cai, Saugata Ghose, Erich F Haratsch, Yixin Luo, and Onur Mutlu. 2017. Error characterization, mitigation, and recovery in flash-memory-based solid-state drives. In Proceedings of the IEEE. IEEE.Google ScholarCross Ref
- Chee-Yong Chan and Yannis E Ioannidis. 1998. Bitmap index design and evaluation. In SIGMOD. ACM.Google Scholar
- Feng Chen, Tian Luo, and Xiaodong Zhang. 2011. CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives.Fast.Google Scholar
- Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ISCA. ACM.Google Scholar
- Jaeyoung Do, Yang-Suk Kee, Jignesh M Patel, Chanik Park, Kwanghyun Park, and David J DeWitt. 2013. Query processing on smart ssds: Opportunities and challenges. In SIGMOD. ACM.Google Scholar
- Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaaauw, and Reetuparna Das. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In ISCA. IEEE.Google Scholar
- ENC. [n.d.]. Cosmos OpenSSD Platform. http://www.openssd-project.org/wiki/Cosmos_OpenSSD_Platform.Google Scholar
- Chiou-Shann Fuh, Shun-Wen Cho, and Kai Essig. 2000. Hierarchical color image region segmentation for content-based image retrieval system. In TIP. IEEE.Google Scholar
- Congming Gao, Liang Shi, Chun Jason Xue, Cheng Ji, Jun Yang, and Youtao Zhang. 2019. Parallel all the time: Plane level parallelism exploration for high performance SSDs. In MSST. IEEE.Google Scholar
- Congming Gao, Liang Shi, Mengying Zhao, Chun Jason Xue, Kaijie Wu, and Edwin H-M Sha. 2014. Exploiting parallelism in I/O scheduling for access conflict minimization in flash-based solid state drives. In MSST. IEEE.Google Scholar
- Congming Gao, Min Ye, Qiao Li, Chun Jason Xue, Youtao Zhang, Liang Shi, and Jun Yang. 2019. Constructing large, durable and fast SSD system via reprogramming 3D TLC flash memory. In MICRO. IEEE.Google Scholar
- Fei Gao, Georgios Tziantzioulis, and David Wentzlaff. 2019. Computedram: In-memory compute using off-the-shelf drams. In MICRO. IEEE.Google ScholarDigital Library
- Saransh Gupta, Mohsen Imani, and Tajana Rosing. 2018. Felix: Fast and energy-efficient logic in memory. In ICCAD. IEEE.Google Scholar
- JongWook Han, Choon-Sik Park, Dae-Hyun Ryu, and Eun-Soo Kim. 1999. Optical image encryption based on XOR operations. In Optical Engineering. International Society for Optics and Photonics.Google Scholar
- Yang Hu, Hong Jiang, Dan Feng, and et al.2012. Exploring and exploiting the multilevel parallelism inside SSDs for improved performance and endurance. In TC. IEEE.Google Scholar
- Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing. 2019. Floatpim: In-memory acceleration of deep neural network training with high precision. In ISCA. IEEE.Google ScholarDigital Library
- Intel 2015. Intel 64M25 Compute NAND Flash Memory Datasheet. Intel.Google Scholar
- Woopyo Jeong, Jae-woo Im, Doo-Hyun Kim, and et al.2015. A 128 Gb 3b/cell V-NAND flash memory with 1 Gb/s I/O rate. In JSSC. IEEE.Google Scholar
- Young Tack Jin, Sungjoon Ahn, and Sungjin Lee. 2018. Performance analysis of nvme ssd-based all-flash array systems. In ISPASS. IEEE.Google Scholar
- Dongku Kang, Woopyo Jeong, Chulbum Kim, and et al.2016. 256 Gb 3 b/cell V-NAND flash memory with 48 stacked WL layers. JSSC.Google Scholar
- Hyukjoong Kim, Dongkun Shin, Yun Ho Jeong, and Kyung Ho Kim. 2017. SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device. In FAST. USENIX, 271–284.Google Scholar
- Sungchan Kim, Hyunok Oh, Chanik Park, and et al.2011. Fast, energy efficient scan inside flash memory SSDs. ADMS.Google Scholar
- Ricardo Koller and Raju Rangaswami. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. In TOS. ACM.Google Scholar
- Gunjae Koo, Kiran Kumar Matam, I Te, and et al.2017. Summarizer: trading communication with computing near storage. In MICRO. IEEE.Google Scholar
- Joo Hwan Lee, Hui Zhang, Veronica Lagrange, and et al.2020. SmartSSD: FPGA Accelerated Near-Storage Data Analytics on SSD. CAL.Google Scholar
- Seungjae Lee, Jin-yub Lee, Il-han Park, and et al.2016. 7.5 A 128Gb 2b/cell NAND flash memory in 14nm technology with tPROG= 640μs and 800MB/s I/O rate. In ISSCC. IEEE.Google Scholar
- Shuangchen Li, Cong Xu, Qiaosha Zou, Jishen Zhao, Yu Lu, and Yuan Xie. 2016. Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories. In DAC. ACM.Google ScholarDigital Library
- Xiaofan Lin, Cong Zhao, and Wei Pan. 2017. Towards accurate binary convolutional neural network. arXiv preprint arXiv:1711.11294.Google Scholar
- Kevin Marks. 2013. An nvm express tutorial. In Flash Memory Summit.Google Scholar
- Sally A McKee. 2004. Reflections on the memory wall. In CF.Google Scholar
- Rino Micheloni, Luca Crippa, and Alessia Marelli. 2010. Inside NAND flash memories. Springer Science & Business Media.Google Scholar
- Micron. [n.d.]. Parallel NAND System Power Calculator. https://www.micron.com/support/tools-and-utilities/nand-system-power-calculator.Google Scholar
- Micron 2018. NAND MLC Flash Memory Datasheet. Micron.Google Scholar
- Kimberly Mlitz. [n.d.]. Data center storage capacity worldwide from 2016 to 2021. https://www.statista.com/statistics/638593/worldwide-data-center-storage-capacity-cloud-vs-traditional//.Google Scholar
- Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. 2020. A Modern Primer on Processing in Memory. In arXiv preprint arXiv:2012.03112.Google Scholar
- Jisung Park, Myungsuk Kim, Myoungjun Chun, Lois Orosa, Jihong Kim, and Onur Mutlu. 2020. Reducing Solid State Drive Read Latency by Optimizing Read-Retry. In ASPLOS. ACM.Google Scholar
- Zhenyuan Ruan, Tong He, and Jason Cong. 2019. INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive. In ATC. USENIX.Google Scholar
- Olga Russakovsky, Jia Deng, Hao Su, and et al.2015. Imagenet large scale visual recognition challenge. In IJCV. Springer.Google Scholar
- Arthur Sainio. 2016. NVDIMM: changes are here so what’s next. In Memory Computing Summit.Google Scholar
- Samsung. [n.d.]. Samsung 970 Pro. https://www.samsung.com/semiconductor/minisite/ssd/product/consumer/970pro/.Google Scholar
- Abu Sebastian, Manuel Le Gallo, Riduan Khaddam-Aljameh, and Evangelos Eleftheriou. 2020. Memory devices and applications for in-memory computing. In Nature nanotechnology. Nature Publishing Group.Google Scholar
- Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, and et al.2014. Willow: A User-Programmable SSD. In OSDI. USENIX.Google Scholar
- Vivek Seshadri, Donghyuk Lee, Thomas Mullins, and et al.2017. Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology. In MICRO. IEEE.Google Scholar
- Yong Ho Song, Sanghyuk Jung, Sang-Won Lee, and Jin-Soo Kim. 2014. Cosmos openSSD: A PCIe-based open source SSD platform. In Flash Memory Summit.Google Scholar
- Synopsys. [n.d.]. H-spice. https://www.synopsys.com/.Google Scholar
- Wei Tang, Gang Hua, and Liang Wang. 2017. How to train a compact binary neural network with high accuracy?AAAI.Google Scholar
- Hossein Valavi, Peter J Ramadge, Eric Nestler, and Naveen Verma. 2019. A 64-tile 2.4-Mb in-memory-computing CNN accelerator employing charge-domain compute. In JSSC. IEEE.Google Scholar
- Grant Wallace, Fred Douglis, Hangwei Qian, and et al.2012. Characteristics of backup workloads in production systems.. In FAST. USENIX.Google Scholar
- Zhuo-Rui Wang, Yu-Ting Su, Yi Li, Ya-Xiong Zhou, Tian-Jian Chu, Kuan-Chang Chang, Ting-Chang Chang, Tsung-Ming Tsai, Simon M Sze, and Xiang-Shui Miao. 2016. Functionally complete Boolean logic in 1T1R resistive random access memory. In EDL. IEEE.Google Scholar
- Xin Xin, Youtao Zhang, and Jun Yang. 2020. ELP2IM: Efficient and Low Power Bitwise Operation Processing in DRAM. In HPCA. IEEE.Google Scholar
- Ching-Nung Yang and Dao-Shun Wang. 2013. Property analysis of XOR-based visual cryptography. In TCSVT. IEEE.Google Scholar
- He Zhang, Wang Kang, Bi Wu, Peng Ouyang, Erya Deng, Youguang Zhang, and Weisheng Zhao. 2019. Spintronic processing unit within voltage-gated spin hall effect MRAMs. In TN. IEEE.Google Scholar
- Li Zhang, Shen gang Hao, Jun Zheng, Yu an Tan, Quan xin Zhang, and Yuan zhang Li. 2015. Descrambling data on solid-state disks by reverse-engineering the firmware. In Digital Investigation.Google Scholar
- Kai Zhao, Wenzhe Zhao, Hongbin Sun, and et al.2013. LDPC-in-SSD: Making advanced error correction codes work effectively in solid state drives. In FAST. USENIX.Google Scholar
Recommendations
NAND flash memory system based on the Harvard buffer architecture for multimedia applications
The main purpose of this research is to design a new memory architecture for NAND flash memory to provide XIP (execute in place) for code execution as well as overcome the biggest bottleneck for data execution. NOR flash for multimedia application is ...
NAND Flash-Based Disk Cache Using SLC/MLC Combined Flash Memory
SNAPI '10: Proceedings of the 2010 International Workshop on Storage Network Architecture and Parallel I/OsFlash memory-based non-volatile cache (NVC) is emerging as an effective solution for enhancing both the performances and the energy consumptions of storage systems. In order to attain significant performance and energy gains from NVC, it would be better ...
Next high performance and low power flash memory package structure
In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time for random read operations. Therefore,...
Comments