research-article

Deep Learning Inferencing with High-performance Hardware Accelerators

Authors:
Luke Kljucaric

NSF SHREC Center, ECE Dept., University of Pittsburgh, USA

NSF SHREC Center, ECE Dept., University of Pittsburgh, USA

0000-0001-6793-3524
View Profile

,
Alan D. George

NSF SHREC Center, ECE Dept., University of Pittsburgh, USA

NSF SHREC Center, ECE Dept., University of Pittsburgh, USA

0000-0001-9665-2879
View Profile

ACM Transactions on Intelligent Systems and Technology Volume 14 Issue 4Article No.: 68pp 1–25https://doi.org/10.1145/3594221

Published:15 June 2023Publication History

ACM Transactions on Intelligent Systems and Technology

Abstract

As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many apps, like real-time video processing, are focused on latency of computations rather than strictly on throughput. This research analyzes multiple compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency and maximum throughput for optical character recognition. Considering that these models are composed of fundamental neural network operations yet architecturally different from each other, these models can stress devices in different yet insightful ways that generalizations of the performance of other models can be drawn from. Many devices featuring ML-specific hardware and optimizations are analyzed including Intel and AMD CPUs, Xilinx and Intel FPGAs, NVIDIA GPUs, and Google TPUs. Overall, ML-oriented hardware added to the Intel Xeon CPUs helps to boost throughput by 3.7× and to reduce latency by up to 34.7×, which makes the latency of Intel Xeon CPUs competitive on more parallel models. The TPU devices were limited in terms of throughput due to large data transfer times and not competitive in terms of latency. The FPGA frameworks showcase the lowest latency on the Xilinx Alveo U200 FPGA achieving 0.48 ms on AlexNet using Mipsology Zebra and 0.39 ms on GoogLeNet using Vitis-AI. Through their custom acceleration datapaths coupled with high-performance SRAM, the FPGAs are able to keep critical model data closer to processing elements for lower latency. The massively parallel and high-memory GPU devices with Tensor Core accelerators achieve the best throughput. The NVIDIA Tesla A100 GPU showcases the highest throughput at 42,513 and 52,484 images/second for AlexNet and GoogLeNet, respectively.¹

REFERENCES

[1] Aarrestad T., Loncar V., et al. 2021. Fast convolutional neural networks on FPGAs with hls4ml. Machine Learning: Science and Technology 2, 4 (2021), 1–25. Google ScholarCross Ref
[2] Abadi Martín, Agarwal Ashish, et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/.Google Scholar
[3] AMD. 2022. 2nd gen AMD EPYC 7702. https://www.amd.com/en/products/cpu/amd-epyc-7702.Google Scholar
[4] Cloud Google. 2022. Cloud TPU breaks scalability records for AI inference. https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-breaks-scalability-records-for-ai-inference.Google Scholar
[5] Cloud Google. 2022. Cloud TPU system architecture. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm.Google Scholar
[6] Delaye Elliot. 2018. Integrating AI into your accelerated cloud applications. Xilinx. https://www.xilinx.com/video/fpga/integrating-ai-into-accelerated-cloud-applications.html.Google Scholar
[7] Deng L. and Yu D.. 2014. Deep learning: Methods and applications. Foundations and Trends in Signal Processing 7, 33–34 (2014), 1–99.Google Scholar
[8] DiCecco R., Lacey G., et al. 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks In. International Conference on Field-Programmable Technology (FPT’16).Google Scholar
[9] Duarte J., Han S., et al. 2018. Fast inference of deep neural networks in FPGAs for particle physics. Journal of Instrumentation 13 (2018), 7–27. Google ScholarCross Ref
[10] Egmont-Petersen M., Ridder D. de, and Handels H.. 2002. Image processing with neural networks - A review. Pattern Recognition 35, 10 (2002), 2279–2301.Google ScholarCross Ref
[11] Farabet C., Martini B., et al. 2010. Hardware accelerated convolutional neural networks for synthetic vision systems. In IEEE International Symposium on Circuits and Systems.Google Scholar
[12] Farabet C., Poulet C., et al. 2009. CNP: An FPGA-based processor for convolutional networks. In International Conference on Field Programmable Logic and Applications (FPL’09).Google Scholar
[13] Fowers J., Brown G., et al. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12).Google ScholarDigital Library
[14] Glorot X., Bordes A., and Bengio Y.. 2011. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (PMLR’11), 14.Google Scholar
[15] Google. 2022. Troubleshooting TensorFlow - TPU. https://cloud.google.com/tpu/docs/troubleshooting/trouble-tf.Google Scholar
[16] Intel. 2022. Enhance artificial intelligence (AI) workloads with built-in accelerators. https://www.intel.com/content/www/us/en/artificial-intelligence/documents/enhance-ai-workloads-built-in-accelerators-pdf.html.Google Scholar
[17] Intel. 2022. Intel stratix 10 NX 2100 FPGA. (2022). https://www.intel.com/content/www/us/en/products/sku/213092/intel-stratix-10-nx-2100-fpga/specifications.html. [Product Brief].Google Scholar
[18] Intel. 2022. Intel Xeon Platinum 8180 processor. https://ark.intel.com/content/www/us/en/ark/products/120496/intel-xeon-platinum-8180-processor-38-5m-cache-2-50-ghz.html.Google Scholar
[19] Intel. 2022. Intel Xeon Platinum 8280 processor. https://ark.intel.com/content/www/us/en/ark/products/192478/intel-xeon-platinum-8280-processor-38-5m-cache-2-70-ghz.html.Google Scholar
[20] Intel. 2022. Intel Xeon Platinum 8380 processor. https://www.intel.com/content/www/us/en/products/sku/212287/intel-xeon-platinum-8380-processor-60m-cache-2-30-ghz/specifications.html.Google Scholar
[21] Intel. 2022. Intel® FPGA deep learning acceleration suite enables Intel FPGAs for accelerated AI optimized for performance, power, and cost. https://dl.dell.com/manuals/common/deep_learning_inferencing_intel_fpga-pt2.pdf.Google Scholar
[22] Intel. 2022. Intel® programmable acceleration card (PAC) with Intel® Arria® 10 GX FPGA data sheet. https://www.intel.com/content/www/us/en/docs/programmable/683226/current/introduction-rush-creek.html#vjb1508359354353.Google Scholar
[23] Intel. 2022. oneDNN v2.7.0 documentation. https://oneapi-src.github.io/oneDNN/group_dnnl_api.html.Google Scholar
[24] Intel. 2022. OpenVINO documentation. https://docs.openvino.ai/latest/documentation.html.Google Scholar
[25] Jia Y., Shelhamer E., and Bengio Y.. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia (MM’14), 22.Google Scholar
[26] Judd P., Albericio J., et al. 2015. Reduced-precision strategies for bounded memory in deep neural nets. CoRR (2015). http://arxiv.org/abs/1511.05236.Google Scholar
[27] Kljucaric L. and George A. D.. 2019. Deep-learning inferencing with high-performance hardware accelerators. In IEEE High Performance Extreme Computing Conference (HPEC’19), 1–7. Google ScholarCross Ref
[28] Kljucaric L., Johnson A., and George A. D.. 2020. Architectural analysis of deep learning on edge accelerators. In IEEE High Performance Extreme Computing Conference (HPEC’20), 1–7. Google ScholarCross Ref
[29] Kochura Y., Gordienko Y., et al. 2018. Batch size influence on performance of graphic and tensor processing units during training and inference phases. arXiv (2018). arXiv:1812.11731.Google Scholar
[30] Krizhevsky A., Sutskever I., and Hinton G. E.. 2012. ImageNet classification with deep convolutional neural networks. Neural Information Processing Systems 25, 2 (2012), 84–90.Google Scholar
[31] Lai S., Jin L., and Yang W.. 2017. Toward high-performance online HCCR: A CNN approach with dropdistortion, path signature and spatial stochastic max-pooling. Pattern Recognition Letters 89 (2017), 60–66.Google ScholarDigital Library
[32] Lane N. D., Bhattacharya S., et al. 2017. Squeezing deep learning into mobile and embedded devices. IEEE Pervasive Computing 16, 3 (2017), 82–88. Google ScholarDigital Library
[33] Langerman D., Johnson A., et al. 2020. Beyond floating-point ops: CNN performance prediction with critical datapath length. In IEEE High Performance Extreme Computing Conference (HPEC’20), 1–9. Google ScholarCross Ref
[34] LeCun Y., Bottou L., et al. 1998. Gradient-based Learning Applied to Document Recognition. IEEE.Google ScholarCross Ref
[35] Lee S. and Lee C.. 2020. Revisiting spatial dropout for regularizing convolutional neural networks. Multimedia Tools and Applications 79 (2020), 34195–34207. Google ScholarDigital Library
[36] Liu W., Wei J., and Meng Q.. 2020. Comparisions on KNN, SVM, BP and the CNN for handwritten digit recognition. In IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA’20), 587–590. Google ScholarCross Ref
[37] Mattson P., Tang H., et al. 2020. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 2 (2020), 8–16. Google ScholarCross Ref
[38] Mipsology. 2022. ZEBRA ACCELERATES MACHINE LEARNING INFERENCE EVERYWHERE. https://mipsology.com/.Google Scholar
[39] Morgan T. P.. 2018. TEASING OUT THE BANG FOR THE BUCK OF INFERENCE ENGINES. https://www.nextplatform.com/2018/10/12/teasing-out-the-bang-for-the-buck-of-inference-engines/.Google Scholar
[40] NVIDIA. 2022. DEEP LEARNING FRAMEWORKS DOCUMENTATION. https://docs.nvidia.com/deeplearning/frameworks/caffe-user-guide/index.html.Google Scholar
[41] NVIDIA. 2022. NVIDIA A10 tensor core GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a10/pdf/a10-datasheet.pdf.Google Scholar
[42] NVIDIA. 2022. NVIDIA A100 tensor core GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf.Google Scholar
[43] NVIDIA. 2022. NVIDIA ampere GPU architecture tuning guide. https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html.Google Scholar
[44] NVIDIA. 2022. NVIDIA T4 tensor core GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf.Google Scholar
[45] NVIDIA. 2022. NVIDIA TensorRT documentation. https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html.Google Scholar
[46] NVIDIA. 2022. NVIDIA Tesla V100 GPU architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.Google Scholar
[47] Ovtcharov K., Ruwase O., et al. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft. https://www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-specialized-hardware/.Google Scholar
[48] Pang B., Nijkamp E., and Wu Y. Nian. 2020. Deep learning with tensorflow: A review. Journal of Educational and Behavioral Statistics 45, 2 (2020), 227–248. Google ScholarCross Ref
[49] Wang S. and Kanwar P.. 2022. BFloat16: The secret to high performance on cloud TPUs. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.Google Scholar
[50] Sandler M., Howard A., et al. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4510–4520. Google ScholarCross Ref
[51] Szegedy C., Liu W., et al. 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1–9.Google ScholarCross Ref
[52] TensorFlow. 2022. Estimator with TPU support. https://www.tensorflow.org/api_docs/python/tf/compat/v1/estimator/tpu/TPUEstimator.Google Scholar
[53] Tsochatzidis L., Costaridou L., and Pratikakis I.. 2019. Deep learning for breast cancer diagnosis from mammograms-A comparative study. Journal of Imaging 5, 37 (2019), 1–11. Google ScholarCross Ref
[54] Wang N., Cho J., et al. 2018. Training deep neural networks with 8-bit floating point numbers. In The Conference on Neural Information Processing Systems (NIPS’18). https://papers.nips.cc/paper/7994-training-deep-neural-networks-with-8-bit-floating-point-numbers.pdf.Google Scholar
[55] Wang Y., Wei G.-Y., and Brooks D.. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv (2019). arXiv:1907.10701.Google Scholar
[56] Werbos P. J.. 1990. Backpropagation through Time: What It Does and How to Do It. IEEE.Google ScholarCross Ref
[57] xie X., Hu X., et al. 2020. NNBench-X: A benchmarking methodology for neural network accelerator designs. ACM Transactions on Architecture and Code Optimization (TACO) 17, 4 (2020), 11–15. Google ScholarDigital Library
[58] Xilinx. 2022. AI inference acceleration. https://www.xilinx.com/applications/megatrends/machine-learning.html.Google Scholar
[59] Xilinx. 2022. Alveo U200 and U250 data center accelerator cards data sheet. https://www.xilinx.com/support/documentation/data_sheets/ds962-u200-u250.pdf.Google Scholar
[60] Xilinx. 2022. Vitis AI user guide (UG1414). https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Vitis-AI-Overview.Google Scholar
[61] Xilinx. 2022. Xilinx power estimator (XPE). https://www.xilinx.com/products/technology/power/xpe.html.Google Scholar
[62] Yang C.-T., Liu J.-C., et al. 2020. Performance benchmarking of deep learning framework on Intel Xeon Phi. Journal of Supercomputing 77, 3 (2020), 2486–2510. Google ScholarDigital Library
[63] Yip M. Y. T., Lim G., et al. 2020. Technical and imaging factors influencing performance of deep learning systems for diabetic retinopathy. NPJ Digital Medicine 3, 40 (2020), 1–12. Google ScholarCross Ref
[64] Zhang C., Li P., et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15).Google Scholar
[65] Zhong Z., Jin L., and Xie Z.. 2015. High performance offline handwritten chinese character recognition using GoogLeNet and directional feature maps In. International Conference on Document Analysis and Recognition (ICDAR’15), 13.Google Scholar

Index Terms

Deep Learning Inferencing with High-performance Hardware Accelerators

Recommendations

Modeling and predicting performance of high performance computing applications on hardware accelerators

Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...
Read More
Comparing Hardware Accelerators in Scientific Applications: A Case Study

Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...
Read More
Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

Computers with hardware accelerators, also referred to as hybrid-core systems, speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of top500 supercomputers use ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Intelligent Systems and Technology Volume 14, Issue 4
August 2023
481 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3596215
Editor:
Huan Liu
Arizona State University, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 June 2023
- Online AM: 2 May 2023
- Accepted: 14 April 2023
- Revised: 2 March 2023
- Received: 8 February 2022
Published in tist Volume 14, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Neural networks
machine learning
FPGA
inference
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 743
  Total Downloads
- Downloads (Last 12 months)743
- Downloads (Last 6 weeks)111
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Deep Learning Inferencing with High-performance Hardware Accelerators

ACM Transactions on Intelligent Systems and Technology

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Modeling and predicting performance of high performance computing applications on hardware accelerators

Comparing Hardware Accelerators in Scientific Applications: A Case Study

Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Deep Learning Inferencing with High-performance Hardware Accelerators

ACM Transactions on Intelligent Systems and Technology

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Modeling and predicting performance of high performance computing applications on hardware accelerators

Comparing Hardware Accelerators in Scientific Applications: A Case Study

Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media