Abstract
To implement machine learning applications in real-time safety-critical systems, we previously introduced a predictable framework named ACETONE. This framework compiles the detailed description of an off-line trained feed-forward deep neural network into an equivalent C code. In this paper, we improve the performance of the generated C code by including gemm-based convolutions in ACETONE. The code incorporating the gemm routines maintains the ACETONE properties of semantics preservation and timing predictability. We compare the proposed method with ACETONE ’s initial version, Keras2c and uTVM on a realistic set of machine learning benchmarks and show that the introduced convolution algorithms allow a trade-off between performance and memory footprint.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Notes
There may be an additional parameter, that is the dilation supported by the code generation and not detailed here.
References
Abadi M, Agarwal A, Barham P, et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. URL https://www.tensorflow.org/, software available from tensorflow.org
Alves E, Bhatt D, Hall B, et al (2018) Considerations in assuring safety of increasingly autonomous systems. NASA
Amiri H, Shahbahrami A (2017) High performance implementation of 2D convolution using Intel’s advanced vector extensions. In: 2017 Artificial intelligence and signal processing conference (AISP), pp 25–30, https://doi.org/10.1109/AISP.2017.8324097
Anderson A, Vasudevan A, Keane C, et al (2017) Low-memory GEMM-based convolution algorithms for deep neural networks. https://doi.org/10.48550/arXiv.1709.03395, arXiv:1709.03395 [cs]
ApacheTVM (2021) microTVM: TVM on bare-metal. URL https://tvm.apache.org/docs/topic/microtvm/index.html
Ballabriga C, Cassé H, Rochange C, et al (2010) OTAWA: an open toolbox for adaptive WCET analysis (regular paper). In: IFIP Workshop on software technologies for future embedded and ubiquitous systems (SEUS)
Bhattacharyya S, Cofer D, Musliner D, et al (2015) Certification considerations for adaptive systems. 2015 International conference on unmanned aircraft systems, ICUAS 2015 pp 270–279. https://doi.org/10.1109/ICUAS.2015.7152300
Chellapilla K, Puri S, Simard P (2006) High performance convolutional neural networks for document processing. In: Lorette G (ed) Tenth international workshop on frontiers in handwriting recognition, Université de Rennes 1. Suvisoft, La Baule (France), URL https://hal.inria.fr/inria-00112631, http://www.suvisoft.com
Chen T, Moreau T, Jiang Z, et al (2018a) TVM: end-to-end optimization stack for deep learning. CoRR arXiv:abs/1802.04799
Chen T, Zheng L, Yan E, et al (2018b) Learning to optimize tensor programs. In: Proceedings of the 32nd international conference on neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, NIPS’18, p 3393-3404
Chetlur S, Woolley C, Vandermersch P, et al (2014) cuDNN: efficient primitives for deep learning. CoRR arXiv:abs/1410.0759
Chichin S, Portes D, Blunder M, et al (2020) Capability to embed deep neural networks: study on CPU processor in avionics context. In: 10th European congress embedded real time systems (ERTS 2020)
Cong J, Xiao B (2014) Minimizing computation in convolutional neural networks. In: Wermter S, Weber C, Duch W et al (eds) Artificial neural networks and machine learning - ICANN 2014. Springer, Cham, pp 281–290
Conlin R, Erickson K, Abbate J et al (2021) Keras2c: a library for converting keras neural networks to real-time compatible C. Eng Appl Artif Intell 100(104):182
developers OR (2021) Onnx runtime. https://onnxruntime.ai/
Dongarra JJ, Du Croz J, Hammarling S et al (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17. https://doi.org/10.1145/77626.79170
Dukhan M (2019) The indirect convolution algorithm. CoRR arXiv:abs/1907.02129
EUROCAE WG-114/SAE joint group (2021) Certification/approval of aeronautical systems based on AI. On going standardization
Gholami A, Kim S, Dong Z, et al (2021) A survey of quantization methods for efficient neural network inference. CoRR arXiv:abs/2103.13630
Gong Y, Liu L, Yang M, et al (2014) Compressing deep convolutional networks using vector quantization. CoRR arXiv:abs/1412.6115
Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):1–25. https://doi.org/10.1145/1356052.1356053
Han S, Mao H, Dally WJ (2016) Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In: Bengio Y, LeCun Y (eds) 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, conference track proceedings, arXiv:org/abs/1510.00149
Hoseinzade E, Haratizadeh S (2019) CNNpred: CNN-based stock market prediction using a diverse set of variables. Expert Syst Appl 129:273–285
IEEE (2019) IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008) pp 1–84. https://doi.org/10.1109/IEEESTD.2019.8766229
Jia Z, Padon O, Thomas J, et al (2019) TASO. In: Proceedings of the 27th ACM symposium on operating systems principles. ACM, https://doi.org/10.1145/3341301.3359630
Kalray (2021) MPPA® Coolidge\(^{{\rm TM}}\) Processor - white paper. https://www.kalrayinc.com/documentation/
Karmani RK, Agha G, Squillante MS et al (2011) ATLAS (Automatically tuned linear algebra software). Encyclopedia of parallel computing. Springer, New York, pp 95–101
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. Rep. 0, University of Toronto
Lattner C, Amini M, Bondhugula U, et al (2021) MLIR: scaling compiler infrastructure for domain specific computation. In: Lee JW, Soffa ML, Zaks A (eds) International symposium on code generation and optimization, (CGO), pp 2–14
Lavin A, Gray S (2016) Fast algorithms for convolutional neural networks. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4013–4021, https://doi.org/10.1109/CVPR.2016.435
LeCun Y, Boser BE, Denker JS et al (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Li C, Yang Y, Feng M, et al (2016) Optimizing memory efficiency for deep convolutional neural networks on GPUs. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC 2016
Lin S, Liu N, Nazemi M, et al (2018) FFT-based deep learning deployment in embedded systems. In: 2018 Design, automation and test in Europe conference and exhibition (DATE, pp 1045–1050, https://doi.org/10.23919/DATE.2018.8342166
Liu Y, Wang Y, Yu R, et al (2018) Optimizing CNN Model Inference on CPUs. https://doi.org/10.48550/ARXIV.1809.02697, arXiv:org/abs/1809.02697
Low TM, Igual FD, Smith TM et al (2016) Analytical modeling is enough for high-performance BLIS. ACM Trans Math Softw 43(2):1–18. https://doi.org/10.1145/2925987
Mathieu M, Henaff M, LeCun Y (2014) Fast training of convolutional networks through FFTS: International conference on learning representations (ICLR2014), cbls, april 2014. 2nd International conference on learning representations, ICLR 2014 ; Conference date: 14-04-2014 through 16-04-2014
NVIDIA (2021) Tensorrt documentation
Park H, Kim D, Ahn J, et al (2016) Zero and data reuse-aware fast convolution for deep neural networks on GPU. In: Proceedings of the eleventh IEEE/ACM/IFIP international conference on hardware/software codesign and system synthesis. Association for computing machinery, New York, NY, USA, CODES ’16, https://doi.org/10.1145/2968456.2968476,
Paszke A, Gross S, Massa F, et al (2019) PyTorch: an imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, et al (eds) Advances in neural information processing systems 32. p 8024–8035
Pearce H, Yang X, Roop PS et al (2020) Designing neural networks for real-time systems. IEEE Embed Syst Lett 13:1–1
Perez-Cerrolaza J, Abella J, Kosmidis L et al (2022) GPU devices for safety-critical systems: a survey. ACM Comput Surv. https://doi.org/10.1145/3549526
Pompougnac H, Beaugnon U, Cohen A, et al (2020) From SSA to synchronous concurrency and back. Research report RR-9380, INRIA Sophia Antipolis - Méditerranée (France), URL https://hal.inria.fr/hal-03043623
Pujol R, Jorba J, Tabani H, et al (2022) Vector extensions in cots processors to increase guaranteed performance in real-time systems. ACM Trans Embed Comput Syst
Ray PP (2022) A review on tinyml: state-of-the-art and prospects. J King Saud Univ Comput Inf Sci 34(4):1595–1623
RTCA/EUROCAE (2011) DO-178C/ED-12C - Software considerations in airborne systems and equipment certification
Schoeberl M, Abbaspour S, Akesson B et al (2015) T-crest: time-predictable multi-core architecture for embedded systems. J Syst Archit 61(9):449–471
Sentieys O, Filip S, Briand D, et al (2021) Adequatedl: approximating deep learning accelerators. In: 24th International symposium on design and diagnostics of electronic circuits systems (DDECS 21)
Silva IDA, Carle T, Gauffriau A, et al (2022) ACETONE: predictable programming framework for ML applications in safety-critical systems. In: 34th Euromicro conference on real-time systems, ECRTS 2022, July 5-8, 2022, Modena, Italy, pp 3:1–3:19
Stahl R (2021) \(\mu\)TVM StaticRT CodeGen. URL https://github.com/tum-ei-eda/utvm_staticrt_codegen
TensorFlow (2022) Simple audio recognition: recognizing keywords. URL https://www.tensorflow.org/tutorials/audio/simple_audio
Texas Instruments (2013) TCI6630K2L Multicore DSP+ARM KeyStone II System-on-Chip. Tech. Rep. SPRS893E, Texas Instruments Incorporated
The Khronos NNEF Working Group (2018) Neural network exchange format
Tollenaere N, Iooss G, Pouget S et al (2022) Autotuning convolutions is easier than you think. ACM Trans Archit Code Optim. https://doi.org/10.1145/3570641
Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):1–33
Warden P (2018) Speech commands: a dataset for limited-vocabulary speech recognition. CoRR arXiv:abs/1804.03209
Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimizations of software and the ATLAS project. Parallel Comput 27(1–2):3–35. https://doi.org/10.1016/s0167-8191(00)00087-9
Wilhelm R, Engblom J, Ermedahl A et al (2008) The worst-case execution-time problem-overview of methods and survey of tools. ACM Trans Embed Comput Syst 7:1–53
Xianyi Z, Qian W, Saar W (2011) Openblas: an optimized BLAS library. URL https://www.openblas.net/
Zhang J, Franchetti F, Low TM (2018) High performance zero-memory overhead direct convolutions. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, pp 5776–5785, URl https://proceedings.mlr.press/v80/zhang18d.html
Zheng L, Jia C, Sun M, et al (2020) Ansor : generating high-performance tensor programs for deep learning. https://doi.org/10.48550/ARXIV.2006.06762, arXiv:org/abs/2006.06762
Funding
This work has benefited from the AI Interdisciplinary Institute ANITI, which is funded by the French “Investing for the Future – PIA3” program under the Grant agreement ANR-19-P3IA-0004.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
De Albuquerque Silva, I., Carle, T., Gauffriau, A. et al. Extending a predictable machine learning framework with efficient gemm-based convolution routines. Real-Time Syst 59, 408–437 (2023). https://doi.org/10.1007/s11241-023-09407-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11241-023-09407-z