Abstract
In this work, we present a computationally efficient Reinforcement Learning mapping search heuristic for finding high quality mappings for N-dimensional convolution loops that uses a computationally inexpensive reward function based on potential data reuse of operands to guide the search process. We also present a RL state representation generalizable to N-dimensional convolution loops, and a state representation parsing strategy ensuring that only valid mappings are evaluated for quality. Our RL search heuristic is applicable to multi-core systems with a memory hierarchy. We show that our RL based search heuristic for a range of 3D convolution layers, at significantly lower computational expense than random search, generally yields mappings with lower Energy-Delay Product (EDP) for an architecture with multiple processing elements with shared memory connected to DRAM. Our evaluation results demonstrated across 19 3D convolution layers, shows that our RL method performed only an average 11.24% of the operations of that of Timeloop’s random search for assessing same number of valid mappings. The mappings found using Timeloop had an average 12.51% higher EDP compared to lowest EDP mapping found using our RL method. Further, the lowest EDP mappings found using our method had an average only 4.69× higher EDP than the theoretical lower bound EDP, with the best case being only 1.29× higher.
- [1] 2023. Cadence. https://www.cadence.com/en_US/home/tools/ip/tensilica-ip.htmlGoogle Scholar
- [2] . 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Google ScholarCross Ref
- [3] . 2020. Chameleon: Adaptive code optimization for expedited deep neural network compilation. CoRR abs/2001.08743 (2020).
arXiv:2001.08743 https://arxiv.org/abs/2001.08743Google Scholar - [4] . 2012. Level-3 BLAS on the TI C6678 multi-core DSP. In 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing. 179–186. Google ScholarDigital Library
- [5] 2021. F3 Netherlands” Marine Seismic Dataset. https://terranubis.com/datainfo/F3-Demo-2020Google Scholar
- [6] 2023. Intel oneAPI Math Kernel Library. https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top.htmlGoogle Scholar
- [7] 2023. Minimal Standard Minstd_rand0 Generator. https://cplusplus.com/reference/random/minstd_rand0/Google Scholar
- [8] 2023. NVIDIA CUDA Basic Linear Algebra Subroutine Library. https://docs.nvidia.com/cuda/cublas/Google Scholar
- [9] . 2020. A highly efficient SGEMM implementation using DMA on the Intel/Movidius Myriad-2. In 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’20). 321–328. Google ScholarCross Ref
- [10] . 2017. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM Trans. Archit. Code Optim. 14, 2, Article
14 (june 2017), 25 pages. Google ScholarDigital Library - [11] . 2015. Always-on vision processing unit for mobile applications. IEEE Micro 35, 2 (2015), 56–66. Google ScholarDigital Library
- [12] . 2016. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. Google ScholarDigital Library
- [13] . 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. Google ScholarCross Ref
- [14] . 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 367–379. Google ScholarDigital Library
- [15] . 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27, 1 (2001), 3–35.
New Trends in High Performance Computing. Google ScholarDigital Library - [16] . 2020. Domain-specific hardware accelerators. Commun. ACM 63, 7 (
June 2020), 48–57. Google ScholarDigital Library - [17] . 2019. DMazeRunner: Executing perfectly nested loops on dataflow accelerators. ACM Trans. Embed. Comput. Syst. 18, 5s, Article
70 (oct 2019), 27 pages. Google ScholarDigital Library - [18] . 2020. Array programming with NumPy. Nature 585, 7825 (
Sept. 2020), 357–362. Google ScholarCross Ref - [19] . 2018. Morph: Flexible acceleration for 3D CNN-based video understanding. CoRR abs/1810.06807 (2018).
arXiv:1810.06807 http://arxiv.org/abs/1810.06807Google Scholar - [20] . 2021. Mind mappings: Enabling efficient algorithm-accelerator mapping space search. CoRR abs/2103.01489 (2021).
arXiv:2103.01489 https://arxiv.org/abs/2103.01489Google Scholar - [21] . 2021. How to Measure FLOP/s for Neural Networks Empirically?Google Scholar
- [22] . 2021. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. Google ScholarCross Ref
- [23] . 1981. I/O complexity: The red-blue pebble game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing (STOC’81). Association for Computing Machinery, New York, NY, USA, 326–333. Google ScholarDigital Library
- [24] . 2017. In-datacenter performance analysis of a tensor processing unit. CoRR abs/1704.04760 (2017).
arXiv:1704.04760 http://arxiv.org/abs/1704.04760Google Scholar - [25] . 2020. GAMMA: Automating the HW mapping of DNN models on accelerators via genetic algorithm. In Proceedings of the 39th International Conference on Computer-Aided Design (ICCAD’20). Association for Computing Machinery, New York, NY, USA, Article
44 , 9 pages. Google ScholarDigital Library - [26] . 1997. Data-centric multi-level blocking. 346–357.Google Scholar
- [27] . 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (
may 2017), 84–90. Google ScholarDigital Library - [28] . 2009. Optimizing matrix multiplication for a short-vector SIMD architecture – CELL processor. Parallel Comput. 35 (
03 2009), 138–150. Google ScholarDigital Library - [29] . 2018. Understanding reuse, performance, and hardware cost of DNN dataflows: A data-centric approach using MAESTRO. Google ScholarCross Ref
- [30] . 2020. Parallel training of deep networks with local updates. Google ScholarCross Ref
- [31] . 2021. ZigZag: Enlarging joint architecture-mapping design space exploration for DNN accelerators. IEEE Trans. Comput. 70, 8 (2021), 1160–1174. Google ScholarCross Ref
- [32] . 2001. Improving memory hierarchy performance for irregular applications using data and computation reorderings. International Journal of Parallel Programming 29 (
01 2001). Google ScholarDigital Library - [33] . 2014. Introductory Lectures on Convex Optimization: A Basic Course (1st ed.). Springer Publishing Company, Incorporated.Google ScholarDigital Library
- [34] . 2015. Efficient LU Factorization for Texas Instruments Keystone Architecture Digital Signal Processors. Master’s Thesis. Royal Institute of Technology (KTH). http://www.diva-portal.org/smash/get/diva2:837145/FULLTEXT01Google Scholar
- [35] . 2022. 8-bit Numerical Formats for Deep Neural Networks.
arxiv:cs.LG/2206.02915 Google Scholar - [36] . 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’19). 304–315. Google ScholarCross Ref
- [37] . 2019. NTX: An energy-efficient streaming accelerator for floating-point generalized reduction workloads in 22 nm FD-SOI. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE’19). 662–667. Google ScholarCross Ref
- [38] . 2017. Proximal policy optimization algorithms. Google ScholarCross Ref
- [39] . 2014. Very deep convolutional networks for large-scale image recognition. Google ScholarCross Ref
- [40] . 2014. Learning spatiotemporal features with 3D convolutional networks. Google ScholarCross Ref
- [41] . 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In 2010 39th International Conference on Parallel Processing Workshops. 207–216. Google ScholarDigital Library
- [42] . 2018. Training deep neural networks with 8-bit floating point numbers.
arxiv:cs.LG/1812.08011 Google Scholar - [43] . 2019. Accelergy: An architecture-level energy estimation methodology for accelerator designs. 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2019), 1–8.Google Scholar
- [44] . 2020. Interstellar: Using halide’s scheduling language to analyze DNN accelerators. 369–383. Google ScholarDigital Library
- [45] . 2019. Stripe: Tensor compilation via the nested polyhedral model. Google ScholarCross Ref
- [46] . 2020. FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). Association for Computing Machinery, New York, NY, USA, 859–873. Google ScholarDigital Library
- [47] . 2016. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. Google ScholarCross Ref
Index Terms
- Computationally Efficient DNN Mapping Search Heuristic using Deep Reinforcement Learning
Recommendations
Conversational Recommender System Using Deep Reinforcement Learning
RecSys '22: Proceedings of the 16th ACM Conference on Recommender SystemsDeep Reinforcement Learning (DRL) uses the best of both Reinforcement Learning and Deep Learning for solving problems which cannot be addressed by them individually. Deep Reinforcement Learning has been used widely for games, robotics etc. Limited work ...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent SystemsRecent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Using Transfer Learning to Speed-Up Reinforcement Learning: A Cased-Based Approach
LARS '10: Proceedings of the 2010 Latin American Robotics Symposium and Intelligent Robotics MeetingReinforcement Learning (RL) is a well-known technique for the solution of problems where agents need to act with success in an unknown environment, learning through trial and error. However, this technique is not efficient enough to be used in ...
Comments