skip to main content
research-article

Computationally Efficient DNN Mapping Search Heuristic using Deep Reinforcement Learning

Published:09 September 2023Publication History
Skip Abstract Section

Abstract

In this work, we present a computationally efficient Reinforcement Learning mapping search heuristic for finding high quality mappings for N-dimensional convolution loops that uses a computationally inexpensive reward function based on potential data reuse of operands to guide the search process. We also present a RL state representation generalizable to N-dimensional convolution loops, and a state representation parsing strategy ensuring that only valid mappings are evaluated for quality. Our RL search heuristic is applicable to multi-core systems with a memory hierarchy. We show that our RL based search heuristic for a range of 3D convolution layers, at significantly lower computational expense than random search, generally yields mappings with lower Energy-Delay Product (EDP) for an architecture with multiple processing elements with shared memory connected to DRAM. Our evaluation results demonstrated across 19 3D convolution layers, shows that our RL method performed only an average 11.24% of the operations of that of Timeloop’s random search for assessing same number of valid mappings. The mappings found using Timeloop had an average 12.51% higher EDP compared to lowest EDP mapping found using our RL method. Further, the lowest EDP mappings found using our method had an average only 4.69× higher EDP than the theoretical lower bound EDP, with the best case being only 1.29× higher.

REFERENCES

  1. [1] 2023. Cadence. https://www.cadence.com/en_US/home/tools/ip/tensilica-ip.htmlGoogle ScholarGoogle Scholar
  2. [2] Abadi Martín, Agarwal Ashish, Barham Paul, Brevdo Eugene, Chen Zhifeng, et al. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Ahn Byung Hoon, Pilligundla Prannoy, Yazdanbakhsh Amir, and Esmaeilzadeh Hadi. 2020. Chameleon: Adaptive code optimization for expedited deep neural network compilation. CoRR abs/2001.08743 (2020). arXiv:2001.08743 https://arxiv.org/abs/2001.08743Google ScholarGoogle Scholar
  4. [4] Ali Murtaza, Stotzer Eric, Igual Francisco D., and Geijn Robert A. van de. 2012. Level-3 BLAS on the TI C6678 multi-core DSP. In 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing. 179186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Anon.2021. F3 Netherlands” Marine Seismic Dataset. https://terranubis.com/datainfo/F3-Demo-2020Google ScholarGoogle Scholar
  6. [6] Anon.2023. Intel oneAPI Math Kernel Library. https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top.htmlGoogle ScholarGoogle Scholar
  7. [7] Anon.2023. Minimal Standard Minstd_rand0 Generator. https://cplusplus.com/reference/random/minstd_rand0/Google ScholarGoogle Scholar
  8. [8] Anon.2023. NVIDIA CUDA Basic Linear Algebra Subroutine Library. https://docs.nvidia.com/cuda/cublas/Google ScholarGoogle Scholar
  9. [9] Bakshi Suyash and Johnsson Lennart. 2020. A highly efficient SGEMM implementation using DMA on the Intel/Movidius Myriad-2. In 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’20). 321328. Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Balasubramonian Rajeev, Kahng Andrew B., Muralimanohar Naveen, Shafiee Ali, and Srinivas Vaishnav. 2017. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM Trans. Archit. Code Optim. 14, 2, Article 14 (june2017), 25 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Barry Brendan, Brick Cormac, Connor Fergal, Donohoe David, Moloney David, et al. 2015. Always-on vision processing unit for mobile applications. IEEE Micro 35, 2 (2015), 5666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Chen Tianqi and Guestrin Carlos. 2016. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Chen Tianqi, Moreau Thierry, Jiang Ziheng, Zheng Lianmin, Yan Eddie, et al. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Chen Yu-Hsin, Emer Joel, and Sze Vivienne. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 367379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Whaley R. Clint, Petitet Antoine, and Dongarra Jack J.. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1 (2001), 335. New Trends in High Performance Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Dally William J., Turakhia Yatish, and Han Song. 2020. Domain-specific hardware accelerators. Commun. ACM 63, 7 (June2020), 4857. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Dave Shail, Kim Youngbin, Avancha Sasikanth, Lee Kyoungwoo, and Shrivastava Aviral. 2019. DMazeRunner: Executing perfectly nested loops on dataflow accelerators. ACM Trans. Embed. Comput. Syst. 18, 5s, Article 70 (oct2019), 27 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Harris Charles R., Millman K. Jarrod, Walt Stéfan J. van der, Gommers Ralf, Virtanen Pauli, et al. 2020. Array programming with NumPy. Nature 585, 7825 (Sept.2020), 357362. Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Hegde Kartik, Agrawal Rohit, Yao Yulun, and Fletcher Christopher W.. 2018. Morph: Flexible acceleration for 3D CNN-based video understanding. CoRR abs/1810.06807 (2018). arXiv:1810.06807 http://arxiv.org/abs/1810.06807Google ScholarGoogle Scholar
  20. [20] Hegde Kartik, Tsai Po-An, Huang Sitao, Chandra Vikas, Parashar Angshuman, et al. 2021. Mind mappings: Enabling efficient algorithm-accelerator mapping space search. CoRR abs/2103.01489 (2021). arXiv:2103.01489 https://arxiv.org/abs/2103.01489Google ScholarGoogle Scholar
  21. [21] Hobbhahn Marius. 2021. How to Measure FLOP/s for Neural Networks Empirically?Google ScholarGoogle Scholar
  22. [22] Huang Qijing, Kang Minwoo, Dinh Grace, Norell Thomas, Kalaiah Aravind, et al. 2021. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Jia-Wei Hong and Kung H. T.. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing (STOC’81). Association for Computing Machinery, New York, NY, USA, 326333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Jouppi Norman P., Young Cliff, Patil Nishant, Patterson David A., Agrawal Gaurav, et al. 2017. In-datacenter performance analysis of a tensor processing unit. CoRR abs/1704.04760 (2017). arXiv:1704.04760 http://arxiv.org/abs/1704.04760Google ScholarGoogle Scholar
  25. [25] Kao Sheng-Chun and Krishna Tushar. 2020. GAMMA: Automating the HW mapping of DNN models on accelerators via genetic algorithm. In Proceedings of the 39th International Conference on Computer-Aided Design (ICCAD’20). Association for Computing Machinery, New York, NY, USA, Article 44, 9 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Kodukula Induprakas, Ahmed Nawaaz, and Pingali d Keshav. 1997. Data-centric multi-level blocking. 346357.Google ScholarGoogle Scholar
  27. [27] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (may2017), 8490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Kurzak Jakub, Alvaro Wesley, and Dongarra Jack. 2009. Optimizing matrix multiplication for a short-vector SIMD architecture – CELL processor. Parallel Comput. 35 (032009), 138150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Kwon Hyoukjun, Chatarasi Prasanth, Pellauer Michael, Parashar Angshuman, Sarkar Vivek, et al. 2018. Understanding reuse, performance, and hardware cost of DNN dataflows: A data-centric approach using MAESTRO. Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Laskin Michael, Metz Luke, Nabarro Seth, Saroufim Mark, Noune Badreddine, et al. 2020. Parallel training of deep networks with local updates. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Mei Linyan, Houshmand Pouya, Jain Vikram, Giraldo Sebastian, and Verhelst Marian. 2021. ZigZag: Enlarging joint architecture-mapping design space exploration for DNN accelerators. IEEE Trans. Comput. 70, 8 (2021), 11601174. Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Mellor-Crummey John, Whalley David, and Kennedy Ken. 2001. Improving memory hierarchy performance for irregular applications using data and computation reorderings. International Journal of Parallel Programming 29 (012001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Nesterov Yurii. 2014. Introductory Lectures on Convex Optimization: A Basic Course (1st ed.). Springer Publishing Company, Incorporated.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Netzer Gilbert. 2015. Efficient LU Factorization for Texas Instruments Keystone Architecture Digital Signal Processors. Master’s Thesis. Royal Institute of Technology (KTH). http://www.diva-portal.org/smash/get/diva2:837145/FULLTEXT01Google ScholarGoogle Scholar
  35. [35] Noune Badreddine, Jones Philip, Justus Daniel, Masters Dominic, and Luschi Carlo. 2022. 8-bit Numerical Formats for Deep Neural Networks. arxiv:cs.LG/2206.02915Google ScholarGoogle Scholar
  36. [36] Parashar Angshuman, Raina Priyanka, Shao Yakun Sophia, Chen Yu-Hsin, Ying Victor A., et al. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 304315. Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Schuiki Fabian, Schaffner Michael, and Benini Luca. 2019. NTX: An energy-efficient streaming accelerator for floating-point generalized reduction workloads in 22 nm FD-SOI. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE’19). 662667. Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Schulman John, Wolski Filip, Dhariwal Prafulla, Radford Alec, and Klimov Oleg. 2017. Proximal policy optimization algorithms. Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2014. Learning spatiotemporal features with 3D convolutional networks. Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Treibig Jan, Hager Georg, and Wellein Gerhard. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In 2010 39th International Conference on Parallel Processing Workshops. 207216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Wang Naigang, Choi Jungwook, Brand Daniel, Chen Chia-Yu, and Gopalakrishnan Kailash. 2018. Training deep neural networks with 8-bit floating point numbers. arxiv:cs.LG/1812.08011Google ScholarGoogle Scholar
  43. [43] Wu Yannan Nellie, Emer Joel S., and Sze Vivienne. 2019. Accelergy: An architecture-level energy estimation methodology for accelerator designs. 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2019), 18.Google ScholarGoogle Scholar
  44. [44] Yang Xuan, Gao Mingyu, Liu Qiaoyi, Setter Jeff, Pu Jing, et al. 2020. Interstellar: Using halide’s scheduling language to analyze DNN accelerators. 369383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Zerrell Tim and Bruestle Jeremy. 2019. Stripe: Tensor compilation via the nested polyhedral model. Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Zheng Size, Liang Yun, Wang Shuo, Chen Renze, and Sheng Kaiwen. 2020. FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). Association for Computing Machinery, New York, NY, USA, 859873. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Çiçek Özgün, Abdulkadir Ahmed, Lienkamp Soeren S., Brox Thomas, and Ronneberger Olaf. 2016. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Computationally Efficient DNN Mapping Search Heuristic using Deep Reinforcement Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Embedded Computing Systems
        ACM Transactions on Embedded Computing Systems  Volume 22, Issue 5s
        Special Issue ESWEEK 2023
        October 2023
        1394 pages
        ISSN:1539-9087
        EISSN:1558-3465
        DOI:10.1145/3614235
        • Editor:
        • Tulika Mitra
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 September 2023
        • Accepted: 13 July 2023
        • Revised: 1 June 2023
        • Received: 16 March 2023
        Published in tecs Volume 22, Issue 5s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)234
        • Downloads (Last 6 weeks)31

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text