Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs

Lin, Zhongyi; Georganas, Evangelos; Owens, John D.

doi:10.1007/978-3-030-85665-6_15

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12820))

Included in the following conference series:

European Conference on Parallel Processing

1840 Accesses

Abstract

In deep learning pipelines, we demonstrate the performance benefits and tradeoffs of combining two convolution layers into a single layer on multicore CPUs. We analyze when and why fusion may result in runtime speedups, and study three types of layer fusion: (a) 3-by-3 depthwise convolution with 1-by-1 convolution, (b) 3-by-3 convolution with 1-by-1 convolution, and (c) two 3-by-3 convolutions. We show that whether fusion is beneficial is dependent on numerous factors, including arithmetic intensity, machine balance, memory footprints, memory access pattern, and the way the output tensor is tiled. We devise a schedule for all these fusion types to automatically generate fused kernels for multicore CPUs through auto-tuning. With more than 30 layers extracted from five CNNs, we achieve a 1.04x geomean with 1.44x max speedup against separate kernels from MKLDNN, and a 1.24x geomean with 2.73x max speed up against AutoTVM-tuned separate kernels in standalone kernel benchmarks. We also show a 1.09x geomean with 1.29x max speedup against TVM, and a 2.09x geomean with 3.35x max speedup against MKLDNN-backed PyTorch, in end-to-end inference tests.

Supported by Intel Labs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, pp. 265–283. OSDI 2016, USA (2016). https://doi.org/10.5555/3026877.3026899
Adams, A., et al.: Learning to optimize Halide with tree search and random programs. ACM Trans. Graph. 38(4), 1–12 (2019). https://doi.org/10.1145/3306346.3322967
Article Google Scholar
Alwani, M., Chen, H., Ferdman, M., Milder, P.: Fused-layer CNN accelerators. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), October 2016. https://doi.org/10.1109/micro.2016.7783725
Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. CoRR arXiv:1512.01274, December 2015
Chen, T., et al.: TVM: end-to-end optimization stack for deep learning. CoRR arXiv:1802.04799, February 2018
Chen, T., et al.: Learning to optimize tensor programs. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3393–3404. NIPS 2018, Red Hook, NY, USA (2018). https://doi.org/10.5555/3327144.3327258
Chetlur, S., et al.: cuDNN: efficient primitives for deep learning. CoRR arXiv:1410.0759 (Oct 2014)
Georganas, E., et al.: Anatomy of high-performance deep learning convolutions on SIMD architectures. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 830–841, November 2018. https://doi.org/10.1109/sc.2018.00069
Georganas, E., et al.: Harnessing deep learning via a single building block. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 222–233 (2020). https://doi.org/10.1109/IPDPS47924.2020.00032
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Heinecke, A., Henry, G., Hutchinson, M., Pabst, H.: LIBXSMM: accelerating small matrix multiplications by runtime code generation. In: SC16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 981–991 (2016). https://doi.org/10.1109/SC.2016.83
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR arXiv:1704.04861, April 2017
Jia, Z., Thomas, J., Warszawski, T., Gao, M., Zaharia, M., Aiken, A.: Optimizing DNN computation with relaxed graph substitutions. In: Talwalkar, A., Smith, V., Zaharia, M. (eds.) Proceedings of Machine Learning and Systems, pp. 27–39 (2019)
Google Scholar
Lavin, A.: Fast algorithms for convolutional neural networks. CoRR arXiv:1509.09308, September 2015
Liu, Y., Wang, Y., Yu, R., Li, M., Sharma, V., Wang, Y.: Optimizing CNN model inference on CPUs, pp. 1025–1040. USENIX ATC 2019, USA (2019). https://doi.org/10.5555/3358807.3358895
Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, 14–16 April 2014, Banff, AB, Canada, Conference Track Proceedings (2014)
Google Scholar
Mullapudi, R.T., Adams, A., Sharlet, D., Ragan-Kelley, J., Fatahalian, K.: Automatically scheduling halide image processing pipelines 35(4), July 2016. https://doi.org/10.1145/2897824.2925952
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035 (2019)
Google Scholar
Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., Durand, F.: Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph. 31(4), 32:1–32:12, July 2012. https://doi.org/10.1145/2185520.2185528
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 519–530. PLDI 2013, Jun 2013. https://doi.org/10.1145/2491956.2462176
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. https://doi.org/10.1109/cvpr.2016.308
Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2815–2823. Los Alamitos, CA, USA, June 2019. https://doi.org/10.1109/CVPR.2019.00293
Wang, X., Li, G., Dong, X., Li, J., Liu, L., Feng, X.: Accelerating deep learning inference with cross-layer data reuse on GPUs. In: Euro-Par 2020: Parallel Processing, pp. 219–233 (2020). https://doi.org/10.1007/978-3-030-57675-2_14
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009). https://doi.org/10.1145/1498765.1498785
Article Google Scholar
Zheng, L., et al.: Ansor: generating high-performance tensor programs for deep learning. In: 14th USENIX Symposium on Operating Systems Design and Implementation, pp. 863–879. OSDI 2020, November 2020
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, Davis, USA
Zhongyi Lin & John D. Owens
Parallel Computing Laboratory, Intel Corporation, Santa Clara, USA
Evangelos Georganas

Authors

Zhongyi Lin
View author publications
You can also search for this author in PubMed Google Scholar
Evangelos Georganas
View author publications
You can also search for this author in PubMed Google Scholar
John D. Owens
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhongyi Lin .

Editor information

Editors and Affiliations

Universidade de Lisboa, Lisbon, Portugal
Leonel Sousa
Universidade de Lisboa, Lisbon, Portugal
Nuno Roma
Universidade de Lisboa, Lisbon, Portugal
Pedro Tomás

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, Z., Georganas, E., Owens, J.D. (2021). Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-85665-6_15
Published: 25 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85664-9
Online ISBN: 978-3-030-85665-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics