Skip to main content

Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs

  • Conference paper
  • First Online:
Euro-Par 2021: Parallel Processing (Euro-Par 2021)

Abstract

In deep learning pipelines, we demonstrate the performance benefits and tradeoffs of combining two convolution layers into a single layer on multicore CPUs. We analyze when and why fusion may result in runtime speedups, and study three types of layer fusion: (a) 3-by-3 depthwise convolution with 1-by-1 convolution, (b) 3-by-3 convolution with 1-by-1 convolution, and (c) two 3-by-3 convolutions. We show that whether fusion is beneficial is dependent on numerous factors, including arithmetic intensity, machine balance, memory footprints, memory access pattern, and the way the output tensor is tiled. We devise a schedule for all these fusion types to automatically generate fused kernels for multicore CPUs through auto-tuning. With more than 30 layers extracted from five CNNs, we achieve a 1.04x geomean with 1.44x max speedup against separate kernels from MKLDNN, and a 1.24x geomean with 2.73x max speed up against AutoTVM-tuned separate kernels in standalone kernel benchmarks. We also show a 1.09x geomean with 1.29x max speedup against TVM, and a 2.09x geomean with 3.35x max speedup against MKLDNN-backed PyTorch, in end-to-end inference tests.

Supported by Intel Labs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, pp. 265–283. OSDI 2016, USA (2016). https://doi.org/10.5555/3026877.3026899

  2. Adams, A., et al.: Learning to optimize Halide with tree search and random programs. ACM Trans. Graph. 38(4), 1–12 (2019). https://doi.org/10.1145/3306346.3322967

    Article  Google Scholar 

  3. Alwani, M., Chen, H., Ferdman, M., Milder, P.: Fused-layer CNN accelerators. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), October 2016. https://doi.org/10.1109/micro.2016.7783725

  4. Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. CoRR arXiv:1512.01274, December 2015

  5. Chen, T., et al.: TVM: end-to-end optimization stack for deep learning. CoRR arXiv:1802.04799, February 2018

  6. Chen, T., et al.: Learning to optimize tensor programs. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3393–3404. NIPS 2018, Red Hook, NY, USA (2018). https://doi.org/10.5555/3327144.3327258

  7. Chetlur, S., et al.: cuDNN: efficient primitives for deep learning. CoRR arXiv:1410.0759 (Oct 2014)

  8. Georganas, E., et al.: Anatomy of high-performance deep learning convolutions on SIMD architectures. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 830–841, November 2018. https://doi.org/10.1109/sc.2018.00069

  9. Georganas, E., et al.: Harnessing deep learning via a single building block. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 222–233 (2020). https://doi.org/10.1109/IPDPS47924.2020.00032

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  11. Heinecke, A., Henry, G., Hutchinson, M., Pabst, H.: LIBXSMM: accelerating small matrix multiplications by runtime code generation. In: SC16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 981–991 (2016). https://doi.org/10.1109/SC.2016.83

  12. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR arXiv:1704.04861, April 2017

  13. Jia, Z., Thomas, J., Warszawski, T., Gao, M., Zaharia, M., Aiken, A.: Optimizing DNN computation with relaxed graph substitutions. In: Talwalkar, A., Smith, V., Zaharia, M. (eds.) Proceedings of Machine Learning and Systems, pp. 27–39 (2019)

    Google Scholar 

  14. Lavin, A.: Fast algorithms for convolutional neural networks. CoRR arXiv:1509.09308, September 2015

  15. Liu, Y., Wang, Y., Yu, R., Li, M., Sharma, V., Wang, Y.: Optimizing CNN model inference on CPUs, pp. 1025–1040. USENIX ATC 2019, USA (2019). https://doi.org/10.5555/3358807.3358895

  16. Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, 14–16 April 2014, Banff, AB, Canada, Conference Track Proceedings (2014)

    Google Scholar 

  17. Mullapudi, R.T., Adams, A., Sharlet, D., Ragan-Kelley, J., Fatahalian, K.: Automatically scheduling halide image processing pipelines 35(4), July 2016. https://doi.org/10.1145/2897824.2925952

  18. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035 (2019)

    Google Scholar 

  19. Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., Durand, F.: Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph. 31(4), 32:1–32:12, July 2012. https://doi.org/10.1145/2185520.2185528

  20. Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 519–530. PLDI 2013, Jun 2013. https://doi.org/10.1145/2491956.2462176

  21. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474

  22. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. https://doi.org/10.1109/cvpr.2016.308

  23. Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2815–2823. Los Alamitos, CA, USA, June 2019. https://doi.org/10.1109/CVPR.2019.00293

  24. Wang, X., Li, G., Dong, X., Li, J., Liu, L., Feng, X.: Accelerating deep learning inference with cross-layer data reuse on GPUs. In: Euro-Par 2020: Parallel Processing, pp. 219–233 (2020). https://doi.org/10.1007/978-3-030-57675-2_14

  25. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009). https://doi.org/10.1145/1498765.1498785

    Article  Google Scholar 

  26. Zheng, L., et al.: Ansor: generating high-performance tensor programs for deep learning. In: 14th USENIX Symposium on Operating Systems Design and Implementation, pp. 863–879. OSDI 2020, November 2020

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhongyi Lin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, Z., Georganas, E., Owens, J.D. (2021). Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85665-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85664-9

  • Online ISBN: 978-3-030-85665-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics