Abstract
As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many apps, like real-time video processing, are focused on latency of computations rather than strictly on throughput. This research analyzes multiple compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency and maximum throughput for optical character recognition. Considering that these models are composed of fundamental neural network operations yet architecturally different from each other, these models can stress devices in different yet insightful ways that generalizations of the performance of other models can be drawn from. Many devices featuring ML-specific hardware and optimizations are analyzed including Intel and AMD CPUs, Xilinx and Intel FPGAs, NVIDIA GPUs, and Google TPUs. Overall, ML-oriented hardware added to the Intel Xeon CPUs helps to boost throughput by 3.7× and to reduce latency by up to 34.7×, which makes the latency of Intel Xeon CPUs competitive on more parallel models. The TPU devices were limited in terms of throughput due to large data transfer times and not competitive in terms of latency. The FPGA frameworks showcase the lowest latency on the Xilinx Alveo U200 FPGA achieving 0.48 ms on AlexNet using Mipsology Zebra and 0.39 ms on GoogLeNet using Vitis-AI. Through their custom acceleration datapaths coupled with high-performance SRAM, the FPGAs are able to keep critical model data closer to processing elements for lower latency. The massively parallel and high-memory GPU devices with Tensor Core accelerators achieve the best throughput. The NVIDIA Tesla A100 GPU showcases the highest throughput at 42,513 and 52,484 images/second for AlexNet and GoogLeNet, respectively.1
- [1] . 2021. Fast convolutional neural networks on FPGAs with hls4ml. Machine Learning: Science and Technology 2, 4 (2021), 1–25. Google ScholarCross Ref
- [2] . 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/.Google Scholar
- [3] . 2022. 2nd gen AMD EPYC 7702. https://www.amd.com/en/products/cpu/amd-epyc-7702.Google Scholar
- [4] . 2022. Cloud TPU breaks scalability records for AI inference. https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-breaks-scalability-records-for-ai-inference.Google Scholar
- [5] . 2022. Cloud TPU system architecture. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm.Google Scholar
- [6] . 2018. Integrating AI into your accelerated cloud applications. Xilinx. https://www.xilinx.com/video/fpga/integrating-ai-into-accelerated-cloud-applications.html.Google Scholar
- [7] . 2014. Deep learning: Methods and applications. Foundations and Trends in Signal Processing 7, 33–34 (2014), 1–99.Google Scholar
- [8] . 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks In. International Conference on Field-Programmable Technology (FPT’16).Google Scholar
- [9] . 2018. Fast inference of deep neural networks in FPGAs for particle physics. Journal of Instrumentation 13 (2018), 7–27. Google ScholarCross Ref
- [10] . 2002. Image processing with neural networks - A review. Pattern Recognition 35, 10 (2002), 2279–2301.Google ScholarCross Ref
- [11] . 2010. Hardware accelerated convolutional neural networks for synthetic vision systems. In IEEE International Symposium on Circuits and Systems.Google Scholar
- [12] . 2009. CNP: An FPGA-based processor for convolutional networks. In International Conference on Field Programmable Logic and Applications (FPL’09).Google Scholar
- [13] . 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12).Google ScholarDigital Library
- [14] . 2011. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (PMLR’11), 14.Google Scholar
- [15] . 2022. Troubleshooting TensorFlow - TPU. https://cloud.google.com/tpu/docs/troubleshooting/trouble-tf.Google Scholar
- [16] . 2022. Enhance artificial intelligence (AI) workloads with built-in accelerators. https://www.intel.com/content/www/us/en/artificial-intelligence/documents/enhance-ai-workloads-built-in-accelerators-pdf.html.Google Scholar
- [17] . 2022. Intel stratix 10 NX 2100 FPGA. (2022). https://www.intel.com/content/www/us/en/products/sku/213092/intel-stratix-10-nx-2100-fpga/specifications.html.
[Product Brief] .Google Scholar - [18] . 2022. Intel Xeon Platinum 8180 processor. https://ark.intel.com/content/www/us/en/ark/products/120496/intel-xeon-platinum-8180-processor-38-5m-cache-2-50-ghz.html.Google Scholar
- [19] . 2022. Intel Xeon Platinum 8280 processor. https://ark.intel.com/content/www/us/en/ark/products/192478/intel-xeon-platinum-8280-processor-38-5m-cache-2-70-ghz.html.Google Scholar
- [20] . 2022. Intel Xeon Platinum 8380 processor. https://www.intel.com/content/www/us/en/products/sku/212287/intel-xeon-platinum-8380-processor-60m-cache-2-30-ghz/specifications.html.Google Scholar
- [21] . 2022. Intel® FPGA deep learning acceleration suite enables Intel FPGAs for accelerated AI optimized for performance, power, and cost. https://dl.dell.com/manuals/common/deep_learning_inferencing_intel_fpga-pt2.pdf.Google Scholar
- [22] . 2022. Intel® programmable acceleration card (PAC) with Intel® Arria® 10 GX FPGA data sheet. https://www.intel.com/content/www/us/en/docs/programmable/683226/current/introduction-rush-creek.html#vjb1508359354353.Google Scholar
- [23] . 2022. oneDNN v2.7.0 documentation. https://oneapi-src.github.io/oneDNN/group_dnnl_api.html.Google Scholar
- [24] . 2022. OpenVINO documentation. https://docs.openvino.ai/latest/documentation.html.Google Scholar
- [25] . 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia (MM’14), 22.Google Scholar
- [26] . 2015. Reduced-precision strategies for bounded memory in deep neural nets. CoRR (2015). http://arxiv.org/abs/1511.05236.Google Scholar
- [27] . 2019. Deep-learning inferencing with high-performance hardware accelerators. In IEEE High Performance Extreme Computing Conference (HPEC’19), 1–7. Google ScholarCross Ref
- [28] . 2020. Architectural analysis of deep learning on edge accelerators. In IEEE High Performance Extreme Computing Conference (HPEC’20), 1–7. Google ScholarCross Ref
- [29] . 2018. Batch size influence on performance of graphic and tensor processing units during training and inference phases. arXiv (2018). arXiv:1812.11731.Google Scholar
- [30] . 2012. ImageNet classification with deep convolutional neural networks. Neural Information Processing Systems 25, 2 (2012), 84–90.Google Scholar
- [31] . 2017. Toward high-performance online HCCR: A CNN approach with dropdistortion, path signature and spatial stochastic max-pooling. Pattern Recognition Letters 89 (2017), 60–66.Google ScholarDigital Library
- [32] . 2017. Squeezing deep learning into mobile and embedded devices. IEEE Pervasive Computing 16, 3 (2017), 82–88. Google ScholarDigital Library
- [33] . 2020. Beyond floating-point ops: CNN performance prediction with critical datapath length. In IEEE High Performance Extreme Computing Conference (HPEC’20), 1–9. Google ScholarCross Ref
- [34] . 1998. Gradient-based Learning Applied to Document Recognition. IEEE.Google ScholarCross Ref
- [35] . 2020. Revisiting spatial dropout for regularizing convolutional neural networks. Multimedia Tools and Applications 79 (2020), 34195–34207. Google ScholarDigital Library
- [36] . 2020. Comparisions on KNN, SVM, BP and the CNN for handwritten digit recognition. In IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA’20), 587–590. Google ScholarCross Ref
- [37] . 2020. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro 40, 2 (2020), 8–16. Google ScholarCross Ref
- [38] . 2022. ZEBRA ACCELERATES MACHINE LEARNING INFERENCE EVERYWHERE. https://mipsology.com/.Google Scholar
- [39] . 2018. TEASING OUT THE BANG FOR THE BUCK OF INFERENCE ENGINES. https://www.nextplatform.com/2018/10/12/teasing-out-the-bang-for-the-buck-of-inference-engines/.Google Scholar
- [40] . 2022. DEEP LEARNING FRAMEWORKS DOCUMENTATION. https://docs.nvidia.com/deeplearning/frameworks/caffe-user-guide/index.html.Google Scholar
- [41] . 2022. NVIDIA A10 tensor core GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a10/pdf/a10-datasheet.pdf.Google Scholar
- [42] . 2022. NVIDIA A100 tensor core GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf.Google Scholar
- [43] . 2022. NVIDIA ampere GPU architecture tuning guide. https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html.Google Scholar
- [44] . 2022. NVIDIA T4 tensor core GPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf.Google Scholar
- [45] . 2022. NVIDIA TensorRT documentation. https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html.Google Scholar
- [46] . 2022. NVIDIA Tesla V100 GPU architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.Google Scholar
- [47] . 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft. https://www.microsoft.com/en-us/research/publication/accelerating-deep-convolutional-neural-networks-using-specialized-hardware/.Google Scholar
- [48] . 2020. Deep learning with tensorflow: A review. Journal of Educational and Behavioral Statistics 45, 2 (2020), 227–248. Google ScholarCross Ref
- [49] . 2022. BFloat16: The secret to high performance on cloud TPUs. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.Google Scholar
- [50] . 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4510–4520. Google ScholarCross Ref
- [51] . 2015. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1–9.Google ScholarCross Ref
- [52] . 2022. Estimator with TPU support. https://www.tensorflow.org/api_docs/python/tf/compat/v1/estimator/tpu/TPUEstimator.Google Scholar
- [53] . 2019. Deep learning for breast cancer diagnosis from mammograms-A comparative study. Journal of Imaging 5, 37 (2019), 1–11. Google ScholarCross Ref
- [54] . 2018. Training deep neural networks with 8-bit floating point numbers. In The Conference on Neural Information Processing Systems (NIPS’18). https://papers.nips.cc/paper/7994-training-deep-neural-networks-with-8-bit-floating-point-numbers.pdf.Google Scholar
- [55] . 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv (2019). arXiv:1907.10701.Google Scholar
- [56] . 1990. Backpropagation through Time: What It Does and How to Do It. IEEE.Google ScholarCross Ref
- [57] . 2020. NNBench-X: A benchmarking methodology for neural network accelerator designs. ACM Transactions on Architecture and Code Optimization (TACO) 17, 4 (2020), 11–15. Google ScholarDigital Library
- [58] . 2022. AI inference acceleration. https://www.xilinx.com/applications/megatrends/machine-learning.html.Google Scholar
- [59] . 2022. Alveo U200 and U250 data center accelerator cards data sheet. https://www.xilinx.com/support/documentation/data_sheets/ds962-u200-u250.pdf.Google Scholar
- [60] . 2022. Vitis AI user guide (UG1414). https://docs.xilinx.com/r/en-US/ug1414-vitis-ai/Vitis-AI-Overview.Google Scholar
- [61] . 2022. Xilinx power estimator (XPE). https://www.xilinx.com/products/technology/power/xpe.html.Google Scholar
- [62] . 2020. Performance benchmarking of deep learning framework on Intel Xeon Phi. Journal of Supercomputing 77, 3 (2020), 2486–2510. Google ScholarDigital Library
- [63] . 2020. Technical and imaging factors influencing performance of deep learning systems for diabetic retinopathy. NPJ Digital Medicine 3, 40 (2020), 1–12. Google ScholarCross Ref
- [64] . 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15).Google Scholar
- [65] . 2015. High performance offline handwritten chinese character recognition using GoogLeNet and directional feature maps In. International Conference on Document Analysis and Recognition (ICDAR’15), 13.Google Scholar
Index Terms
- Deep Learning Inferencing with High-performance Hardware Accelerators
Recommendations
Modeling and predicting performance of high performance computing applications on hardware accelerators
Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...
Comparing Hardware Accelerators in Scientific Applications: A Case Study
Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...
Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD ForumComputers with hardware accelerators, also referred to as hybrid-core systems, speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of top500 supercomputers use ...
Comments