research-article

Computationally Efficient DNN Mapping Search Heuristic using Deep Reinforcement Learning

Authors:
Suyash Bakshi

University of Houston, Main Campus, USA

University of Houston, Main Campus, USA

0009-0006-3569-9589
View Profile

,
Lennart Johnsson

University of Houston, Main Campus, USA

University of Houston, Main Campus, USA

0000-0003-0337-879X
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 22 Issue 5sArticle No.: 115pp 1–21https://doi.org/10.1145/3609110

Published:09 September 2023Publication History

ACM Transactions on Embedded Computing Systems

Abstract

In this work, we present a computationally efficient Reinforcement Learning mapping search heuristic for finding high quality mappings for N-dimensional convolution loops that uses a computationally inexpensive reward function based on potential data reuse of operands to guide the search process. We also present a RL state representation generalizable to N-dimensional convolution loops, and a state representation parsing strategy ensuring that only valid mappings are evaluated for quality. Our RL search heuristic is applicable to multi-core systems with a memory hierarchy. We show that our RL based search heuristic for a range of 3D convolution layers, at significantly lower computational expense than random search, generally yields mappings with lower Energy-Delay Product (EDP) for an architecture with multiple processing elements with shared memory connected to DRAM. Our evaluation results demonstrated across 19 3D convolution layers, shows that our RL method performed only an average 11.24% of the operations of that of Timeloop’s random search for assessing same number of valid mappings. The mappings found using Timeloop had an average 12.51% higher EDP compared to lowest EDP mapping found using our RL method. Further, the lowest EDP mappings found using our method had an average only 4.69× higher EDP than the theoretical lower bound EDP, with the best case being only 1.29× higher.

REFERENCES

[1] 2023. Cadence. https://www.cadence.com/en_US/home/tools/ip/tensilica-ip.htmlGoogle Scholar
[2] Abadi Martín, Agarwal Ashish, Barham Paul, Brevdo Eugene, Chen Zhifeng, et al. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Google ScholarCross Ref
[3] Ahn Byung Hoon, Pilligundla Prannoy, Yazdanbakhsh Amir, and Esmaeilzadeh Hadi. 2020. Chameleon: Adaptive code optimization for expedited deep neural network compilation. CoRR abs/2001.08743 (2020). arXiv:2001.08743 https://arxiv.org/abs/2001.08743Google Scholar
[4] Ali Murtaza, Stotzer Eric, Igual Francisco D., and Geijn Robert A. van de. 2012. Level-3 BLAS on the TI C6678 multi-core DSP. In 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing. 179–186. Google ScholarDigital Library
[5] Anon.2021. F3 Netherlands” Marine Seismic Dataset. https://terranubis.com/datainfo/F3-Demo-2020Google Scholar
[6] Anon.2023. Intel oneAPI Math Kernel Library. https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top.htmlGoogle Scholar
[7] Anon.2023. Minimal Standard Minstd_rand0 Generator. https://cplusplus.com/reference/random/minstd_rand0/Google Scholar
[8] Anon.2023. NVIDIA CUDA Basic Linear Algebra Subroutine Library. https://docs.nvidia.com/cuda/cublas/Google Scholar
[9] Bakshi Suyash and Johnsson Lennart. 2020. A highly efficient SGEMM implementation using DMA on the Intel/Movidius Myriad-2. In 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’20). 321–328. Google ScholarCross Ref
[10] Balasubramonian Rajeev, Kahng Andrew B., Muralimanohar Naveen, Shafiee Ali, and Srinivas Vaishnav. 2017. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM Trans. Archit. Code Optim. 14, 2, Article 14 (june2017), 25 pages. Google ScholarDigital Library
[11] Barry Brendan, Brick Cormac, Connor Fergal, Donohoe David, Moloney David, et al. 2015. Always-on vision processing unit for mobile applications. IEEE Micro 35, 2 (2015), 56–66. Google ScholarDigital Library
[12] Chen Tianqi and Guestrin Carlos. 2016. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. Google ScholarDigital Library
[13] Chen Tianqi, Moreau Thierry, Jiang Ziheng, Zheng Lianmin, Yan Eddie, et al. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. Google ScholarCross Ref
[14] Chen Yu-Hsin, Emer Joel, and Sze Vivienne. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 367–379. Google ScholarDigital Library
[15] Whaley R. Clint, Petitet Antoine, and Dongarra Jack J.. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1 (2001), 3–35. New Trends in High Performance Computing.Google ScholarDigital Library
[16] Dally William J., Turakhia Yatish, and Han Song. 2020. Domain-specific hardware accelerators. Commun. ACM 63, 7 (June2020), 48–57. Google ScholarDigital Library
[17] Dave Shail, Kim Youngbin, Avancha Sasikanth, Lee Kyoungwoo, and Shrivastava Aviral. 2019. DMazeRunner: Executing perfectly nested loops on dataflow accelerators. ACM Trans. Embed. Comput. Syst. 18, 5s, Article 70 (oct2019), 27 pages. Google ScholarDigital Library
[18] Harris Charles R., Millman K. Jarrod, Walt Stéfan J. van der, Gommers Ralf, Virtanen Pauli, et al. 2020. Array programming with NumPy. Nature 585, 7825 (Sept.2020), 357–362. Google ScholarCross Ref
[19] Hegde Kartik, Agrawal Rohit, Yao Yulun, and Fletcher Christopher W.. 2018. Morph: Flexible acceleration for 3D CNN-based video understanding. CoRR abs/1810.06807 (2018). arXiv:1810.06807 http://arxiv.org/abs/1810.06807Google Scholar
[20] Hegde Kartik, Tsai Po-An, Huang Sitao, Chandra Vikas, Parashar Angshuman, et al. 2021. Mind mappings: Enabling efficient algorithm-accelerator mapping space search. CoRR abs/2103.01489 (2021). arXiv:2103.01489 https://arxiv.org/abs/2103.01489Google Scholar
[21] Hobbhahn Marius. 2021. How to Measure FLOP/s for Neural Networks Empirically?Google Scholar
[22] Huang Qijing, Kang Minwoo, Dinh Grace, Norell Thomas, Kalaiah Aravind, et al. 2021. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. Google ScholarCross Ref
[23] Jia-Wei Hong and Kung H. T.. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing (STOC’81). Association for Computing Machinery, New York, NY, USA, 326–333. Google ScholarDigital Library
[24] Jouppi Norman P., Young Cliff, Patil Nishant, Patterson David A., Agrawal Gaurav, et al. 2017. In-datacenter performance analysis of a tensor processing unit. CoRR abs/1704.04760 (2017). arXiv:1704.04760 http://arxiv.org/abs/1704.04760Google Scholar
[25] Kao Sheng-Chun and Krishna Tushar. 2020. GAMMA: Automating the HW mapping of DNN models on accelerators via genetic algorithm. In Proceedings of the 39th International Conference on Computer-Aided Design (ICCAD’20). Association for Computing Machinery, New York, NY, USA, Article 44, 9 pages. Google ScholarDigital Library
[26] Kodukula Induprakas, Ahmed Nawaaz, and Pingali d Keshav. 1997. Data-centric multi-level blocking. 346–357.Google Scholar
[27] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (may2017), 84–90. Google ScholarDigital Library
[28] Kurzak Jakub, Alvaro Wesley, and Dongarra Jack. 2009. Optimizing matrix multiplication for a short-vector SIMD architecture – CELL processor. Parallel Comput. 35 (032009), 138–150. Google ScholarDigital Library
[29] Kwon Hyoukjun, Chatarasi Prasanth, Pellauer Michael, Parashar Angshuman, Sarkar Vivek, et al. 2018. Understanding reuse, performance, and hardware cost of DNN dataflows: A data-centric approach using MAESTRO. Google ScholarCross Ref
[30] Laskin Michael, Metz Luke, Nabarro Seth, Saroufim Mark, Noune Badreddine, et al. 2020. Parallel training of deep networks with local updates. Google ScholarCross Ref
[31] Mei Linyan, Houshmand Pouya, Jain Vikram, Giraldo Sebastian, and Verhelst Marian. 2021. ZigZag: Enlarging joint architecture-mapping design space exploration for DNN accelerators. IEEE Trans. Comput. 70, 8 (2021), 1160–1174. Google ScholarCross Ref
[32] Mellor-Crummey John, Whalley David, and Kennedy Ken. 2001. Improving memory hierarchy performance for irregular applications using data and computation reorderings. International Journal of Parallel Programming 29 (012001). Google ScholarDigital Library
[33] Nesterov Yurii. 2014. Introductory Lectures on Convex Optimization: A Basic Course (1st ed.). Springer Publishing Company, Incorporated.Google ScholarDigital Library
[34] Netzer Gilbert. 2015. Efficient LU Factorization for Texas Instruments Keystone Architecture Digital Signal Processors. Master’s Thesis. Royal Institute of Technology (KTH). http://www.diva-portal.org/smash/get/diva2:837145/FULLTEXT01Google Scholar
[35] Noune Badreddine, Jones Philip, Justus Daniel, Masters Dominic, and Luschi Carlo. 2022. 8-bit Numerical Formats for Deep Neural Networks. arxiv:cs.LG/2206.02915Google Scholar
[36] Parashar Angshuman, Raina Priyanka, Shao Yakun Sophia, Chen Yu-Hsin, Ying Victor A., et al. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 304–315. Google ScholarCross Ref
[37] Schuiki Fabian, Schaffner Michael, and Benini Luca. 2019. NTX: An energy-efficient streaming accelerator for floating-point generalized reduction workloads in 22 nm FD-SOI. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE’19). 662–667. Google ScholarCross Ref
[38] Schulman John, Wolski Filip, Dhariwal Prafulla, Radford Alec, and Klimov Oleg. 2017. Proximal policy optimization algorithms. Google ScholarCross Ref
[39] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. Google ScholarCross Ref
[40] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2014. Learning spatiotemporal features with 3D convolutional networks. Google ScholarCross Ref
[41] Treibig Jan, Hager Georg, and Wellein Gerhard. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In 2010 39th International Conference on Parallel Processing Workshops. 207–216. Google ScholarDigital Library
[42] Wang Naigang, Choi Jungwook, Brand Daniel, Chen Chia-Yu, and Gopalakrishnan Kailash. 2018. Training deep neural networks with 8-bit floating point numbers. arxiv:cs.LG/1812.08011Google Scholar
[43] Wu Yannan Nellie, Emer Joel S., and Sze Vivienne. 2019. Accelergy: An architecture-level energy estimation methodology for accelerator designs. 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2019), 1–8.Google Scholar
[44] Yang Xuan, Gao Mingyu, Liu Qiaoyi, Setter Jeff, Pu Jing, et al. 2020. Interstellar: Using halide’s scheduling language to analyze DNN accelerators. 369–383. Google ScholarDigital Library
[45] Zerrell Tim and Bruestle Jeremy. 2019. Stripe: Tensor compilation via the nested polyhedral model. Google ScholarCross Ref
[46] Zheng Size, Liang Yun, Wang Shuo, Chen Renze, and Sheng Kaiwen. 2020. FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). Association for Computing Machinery, New York, NY, USA, 859–873. Google ScholarDigital Library
[47] Çiçek Özgün, Abdulkadir Ahmed, Lienkamp Soeren S., Brox Thomas, and Ronneberger Olaf. 2016. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. Google ScholarCross Ref

Index Terms

Computationally Efficient DNN Mapping Search Heuristic using Deep Reinforcement Learning
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Reinforcement learning

Recommendations

Conversational Recommender System Using Deep Reinforcement Learning
RecSys '22: Proceedings of the 16th ACM Conference on Recommender Systems

Deep Reinforcement Learning (DRL) uses the best of both Reinforcement Learning and Deep Learning for solving problems which cannot be addressed by them individually. Deep Reinforcement Learning has been used widely for games, robotics etc. Limited work ...
Read More
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Read More
Using Transfer Learning to Speed-Up Reinforcement Learning: A Cased-Based Approach
LARS '10: Proceedings of the 2010 Latin American Robotics Symposium and Intelligent Robotics Meeting

Reinforcement Learning (RL) is a well-known technique for the solution of problems where agents need to act with success in an unknown environment, learning through trial and error. However, this technique is not efficient enough to be used in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 22, Issue 5s
Special Issue ESWEEK 2023
October 2023
1394 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3614235
Editor:
Tulika Mitra
National University of Singapore, Singapore
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 9 September 2023
- Accepted: 13 July 2023
- Revised: 1 June 2023
- Received: 16 March 2023
Published in tecs Volume 22, Issue 5s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Reinforcement learning
DNN
mapping search
convolution
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 234
  Total Downloads
- Downloads (Last 12 months)234
- Downloads (Last 6 weeks)31
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Computationally Efficient DNN Mapping Search Heuristic using Deep Reinforcement Learning

ACM Transactions on Embedded Computing Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Conversational Recommender System Using Deep Reinforcement Learning

Reward Shaping in Episodic Reinforcement Learning

Using Transfer Learning to Speed-Up Reinforcement Learning: A Cased-Based Approach