skip to main content
research-article

Deep Learning Inferencing with High-performance Hardware Accelerators

Published:15 June 2023Publication History
Skip Abstract Section

Abstract

As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many apps, like real-time video processing, are focused on latency of computations rather than strictly on throughput. This research analyzes multiple compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency and maximum throughput for optical character recognition. Considering that these models are composed of fundamental neural network operations yet architecturally different from each other, these models can stress devices in different yet insightful ways that generalizations of the performance of other models can be drawn from. Many devices featuring ML-specific hardware and optimizations are analyzed including Intel and AMD CPUs, Xilinx and Intel FPGAs, NVIDIA GPUs, and Google TPUs. Overall, ML-oriented hardware added to the Intel Xeon CPUs helps to boost throughput by 3.7× and to reduce latency by up to 34.7×, which makes the latency of Intel Xeon CPUs competitive on more parallel models. The TPU devices were limited in terms of throughput due to large data transfer times and not competitive in terms of latency. The FPGA frameworks showcase the lowest latency on the Xilinx Alveo U200 FPGA achieving 0.48 ms on AlexNet using Mipsology Zebra and 0.39 ms on GoogLeNet using Vitis-AI. Through their custom acceleration datapaths coupled with high-performance SRAM, the FPGAs are able to keep critical model data closer to processing elements for lower latency. The massively parallel and high-memory GPU devices with Tensor Core accelerators achieve the best throughput. The NVIDIA Tesla A100 GPU showcases the highest throughput at 42,513 and 52,484 images/second for AlexNet and GoogLeNet, respectively.1

REFERENCES

  1. [1] Aarrestad T., Loncar V., et al. 2021. Fast convolutional neural networks on FPGAs with hls4ml. Machine Learning: Science and Technology 2, 4 (2021), 1–25. Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Abadi Martín, Agarwal Ashish, et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/.Google ScholarGoogle Scholar
  3. [3] AMD. 2022. 2nd gen AMD EPYC 7702. https://www.amd.com/en/products/cpu/amd-epyc-7702.Google ScholarGoogle Scholar
  4. [4] Cloud Google. 2022. Cloud TPU breaks scalability records for AI inference. https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-breaks-scalability-records-for-ai-inference.Google ScholarGoogle Scholar
  5. [5] Cloud Google. 2022. Cloud TPU system architecture. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm.Google ScholarGoogle Scholar
  6. [6] Delaye Elliot. 2018. Integrating AI into your accelerated cloud applications. Xilinx. https://www.xilinx.com/video/fpga/integrating-ai-into-accelerated-cloud-applications.html.Google ScholarGoogle Scholar
  7. [7] Deng L. and Yu D.. 2014. Deep learning: Methods and applications. Foundations and Trends in Signal Processing 7, 33–34 (2014), 199.Google ScholarGoogle Scholar
  8. [8] DiCecco R., Lacey G., et al. 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks In. International Conference on Field-Programmable Technology (FPT’16).Google ScholarGoogle Scholar
  9. [9] Duarte J., Han S., et al. 2018. Fast inference of deep neural networks in FPGAs for particle physics. Journal of Instrumentation 13 (2018), 727. Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Egmont-Petersen M., Ridder D. de, and Handels H.. 2002. Image processing with neural networks - A review. Pattern Recognition 35, 10 (2002), 22792301.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Farabet C., Martini B., et al. 2010. Hardware accelerated convolutional neural networks for synthetic vision systems. In IEEE International Symposium on Circuits and Systems.Google ScholarGoogle Scholar
  12. [12] Farabet C., Poulet C., et al. 2009. CNP: An FPGA-based processor for convolutional networks. In International Conference on Field Programmable Logic and Applications (FPL’09).Google ScholarGoogle Scholar
  13. [13] Fowers J., Brown G., et al. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12).Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Glorot X., Bordes A., and Bengio Y.. 2011. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (PMLR’11), 14.Google ScholarGoogle Scholar
  15. [15] Google. 2022. Troubleshooting TensorFlow - TPU. https://cloud.google.com/tpu/docs/troubleshooting/trouble-tf.Google ScholarGoogle Scholar
  16. [16] Intel. 2022. Enhance artificial intelligence (AI) workloads with built-in accelerators. https://www.intel.com/content/www/us/en/artificial-intelligence/documents/enhance-ai-workloads-built-in-accelerators-pdf.html.Google ScholarGoogle Scholar
  17. [17] Intel. 2022. Intel stratix 10 NX 2100 FPGA. (2022). https://www.intel.com/content/www/us/en/products/sku/213092/intel-stratix-10-nx-2100-fpga/specifications.html. [Product Brief].Google ScholarGoogle Scholar
  18. [18] Intel. 2022. Intel Xeon Platinum 8180 processor. https://ark.intel.com/content/www/us/en/ark/products/120496/intel-xeon-platinum-8180-processor-38-5m-cache-2-50-ghz.html.Google ScholarGoogle Scholar
  19. [19] Intel. 2022. Intel Xeon Platinum 8280 processor. https://ark.intel.com/content/www/us/en/ark/products/192478/intel-xeon-platinum-8280-processor-38-5m-cache-2-70-ghz.html.Google ScholarGoogle Scholar
  20. [20] Intel. 2022. Intel Xeon Platinum 8380 processor. https://www.intel.com/content/www/us/en/products/sku/212287/intel-xeon-platinum-8380-processor-60m-cache-2-30-ghz/specifications.html.Google ScholarGoogle Scholar
  21. [21] Intel. 2022. Intel® FPGA deep learning acceleration suite enables Intel FPGAs for accelerated AI optimized for performance, power, and cost. https://dl.dell.com/manuals/common/deep_learning_inferencing_intel_fpga-pt2.pdf.Google ScholarGoogle Scholar
  22. [22] Intel. 2022. Intel® programmable acceleration card (PAC) with Intel® Arria® 10 GX FPGA data sheet. https://www.intel.com/content/www/us/en/docs/programmable/683226/current/introduction-rush-creek.html#vjb1508359354353.Google ScholarGoogle Scholar
  23. [23] Intel. 2022. oneDNN v2.7.0 documentation. https://oneapi-src.github.io/oneDNN/group_dnnl_api.html.Google ScholarGoogle Scholar
  24. [24] Intel. 2022. OpenVINO documentation. https://docs.openvino.ai/latest/documentation.html.Google ScholarGoogle Scholar
  25. [25] Jia Y., Shelhamer E., and Bengio Y.. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia (MM’14), 22.Google ScholarGoogle Scholar
  26. [26] Judd P., Albericio J., et al. 2015. Reduced-precision strategies for bounded memory in deep neural nets. CoRR (2015). http://arxiv.org/abs/1511.05236.Google ScholarGoogle Scholar
  27. [27] Kljucaric L. and George A. D.. 2019. Deep-learning inferencing with high-performance hardware accelerators. In IEEE High Performance Extreme Computing Conference (HPEC’19), 17. Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Kljucaric L., Johnson A., and George A. D.. 2020. Architectural analysis of deep learning on edge accelerators. In IEEE High Performance Extreme Computing Conference (HPEC’20), 17. Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Kochura Y., Gordienko Y., et al. 2018. Batch size influence on performance of graphic and tensor processing units during training and inference phases. arXiv (2018). arXiv:1812.11731.Google ScholarGoogle Scholar
  30. [30] Krizhevsky A., Sutskever I., and Hinton G. E.. 2012. ImageNet classification with deep convolutional neural networks. Neural Information Processing Systems 25, 2 (2012), 84–90.Google ScholarGoogle Scholar
  31. [31] Lai S., Jin L., and Yang W.. 2017. Toward high-performance online HCCR: A CNN approach with dropdistortion, path signature and spatial stochastic max-pooling. Pattern Recognition Letters 89 (2017), 60–66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Lane N. D., Bhattacharya S., et al. 2017. Squeezing deep learning into mobile and embedded devices. IEEE Pervasive Computing 16, 3 (2017), 8288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Langerman D., Johnson A., et al. 2020. Beyond floating-point ops: CNN performance prediction with critical datapath length. In IEEE High Performance Extreme Computing Conference (HPEC’20), 19. Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] LeCun Y., Bottou L., et al. 1998. Gradient-based Learning Applied to Document Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Lee S. and Lee C.. 2020. Revisiting spatial dropout for regularizing convolutional neural networks. Multimedia Tools and Applications 79 (2020), 34195–34207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Liu W., Wei J., and Meng Q.. 2020. Comparisions on KNN, SVM, BP and the CNN for handwritten digit recognition. In IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA’20), 587590. Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Mattson P., Tang H., et al. 2020. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 2 (2020), 816. Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Mipsology. 2022. ZEBRA ACCELERATES MACHINE LEARNING INFERENCE EVERYWHERE. https://mipsology.com/.Google ScholarGoogle Scholar
  39. [39] Morgan T. P.. 2018. TEASING OUT THE BANG FOR THE BUCK OF INFERENCE ENGINES. https://www.nextplatform.com/2018/10/12/teasing-out-the-bang-for-the-buck-of-inference-engines/.Google ScholarGoogle Scholar
  40. [40] NVIDIA. 2022. DEEP LEARNING FRAMEWORKS DOCUMENTATION. https://docs.nvidia.com/deeplearning/frameworks/caffe-user-guide/index.html.Google ScholarGoogle Scholar
  41. [41] NVIDIA. 2022. NVIDIA A10 tensor core GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a10/pdf/a10-datasheet.pdf.Google ScholarGoogle Scholar
  42. [42] NVIDIA. 2022. NVIDIA A100 tensor core GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf.Google ScholarGoogle Scholar
  43. [43] NVIDIA. 2022. NVIDIA ampere GPU architecture tuning guide. https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html.Google ScholarGoogle Scholar
  44. [44] NVIDIA. 2022. NVIDIA T4 tensor core GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf.Google ScholarGoogle Scholar
  45. [45] NVIDIA. 2022. NVIDIA TensorRT documentation. https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html.Google ScholarGoogle Scholar
  46. [46] NVIDIA. 2022. NVIDIA Tesla V100 GPU architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.Google ScholarGoogle Scholar
  47. [47] Ovtcharov K., Ruwase O., et al. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft. https://www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-specialized-hardware/.Google ScholarGoogle Scholar
  48. [48] Pang B., Nijkamp E., and Wu Y. Nian. 2020. Deep learning with tensorflow: A review. Journal of Educational and Behavioral Statistics 45, 2 (2020), 227248. Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wang S. and Kanwar P.. 2022. BFloat16: The secret to high performance on cloud TPUs. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.Google ScholarGoogle Scholar
  50. [50] Sandler M., Howard A., et al. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 45104520. Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Szegedy C., Liu W., et al. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1–9.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] TensorFlow. 2022. Estimator with TPU support. https://www.tensorflow.org/api_docs/python/tf/compat/v1/estimator/tpu/TPUEstimator.Google ScholarGoogle Scholar
  53. [53] Tsochatzidis L., Costaridou L., and Pratikakis I.. 2019. Deep learning for breast cancer diagnosis from mammograms-A comparative study. Journal of Imaging 5, 37 (2019), 1–11. Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Wang N., Cho J., et al. 2018. Training deep neural networks with 8-bit floating point numbers. In The Conference on Neural Information Processing Systems (NIPS’18). https://papers.nips.cc/paper/7994-training-deep-neural-networks-with-8-bit-floating-point-numbers.pdf.Google ScholarGoogle Scholar
  55. [55] Wang Y., Wei G.-Y., and Brooks D.. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv (2019). arXiv:1907.10701.Google ScholarGoogle Scholar
  56. [56] Werbos P. J.. 1990. Backpropagation through Time: What It Does and How to Do It. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] xie X., Hu X., et al. 2020. NNBench-X: A benchmarking methodology for neural network accelerator designs. ACM Transactions on Architecture and Code Optimization (TACO) 17, 4 (2020), 11–15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Xilinx. 2022. AI inference acceleration. https://www.xilinx.com/applications/megatrends/machine-learning.html.Google ScholarGoogle Scholar
  59. [59] Xilinx. 2022. Alveo U200 and U250 data center accelerator cards data sheet. https://www.xilinx.com/support/documentation/data_sheets/ds962-u200-u250.pdf.Google ScholarGoogle Scholar
  60. [60] Xilinx. 2022. Vitis AI user guide (UG1414). https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Vitis-AI-Overview.Google ScholarGoogle Scholar
  61. [61] Xilinx. 2022. Xilinx power estimator (XPE). https://www.xilinx.com/products/technology/power/xpe.html.Google ScholarGoogle Scholar
  62. [62] Yang C.-T., Liu J.-C., et al. 2020. Performance benchmarking of deep learning framework on Intel Xeon Phi. Journal of Supercomputing 77, 3 (2020), 2486–2510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Yip M. Y. T., Lim G., et al. 2020. Technical and imaging factors influencing performance of deep learning systems for diabetic retinopathy. NPJ Digital Medicine 3, 40 (2020), 1–12. Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Zhang C., Li P., et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15).Google ScholarGoogle Scholar
  65. [65] Zhong Z., Jin L., and Xie Z.. 2015. High performance offline handwritten chinese character recognition using GoogLeNet and directional feature maps In. International Conference on Document Analysis and Recognition (ICDAR’15), 13.Google ScholarGoogle Scholar

Index Terms

  1. Deep Learning Inferencing with High-performance Hardware Accelerators

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Intelligent Systems and Technology
            ACM Transactions on Intelligent Systems and Technology  Volume 14, Issue 4
            August 2023
            481 pages
            ISSN:2157-6904
            EISSN:2157-6912
            DOI:10.1145/3596215
            • Editor:
            • Huan Liu
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 June 2023
            • Online AM: 2 May 2023
            • Accepted: 14 April 2023
            • Revised: 2 March 2023
            • Received: 8 February 2022
            Published in tist Volume 14, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text