Survey on Software Tools that Implement Deep Learning Algorithms on Intel/x86 and IBM/Power8/Power9 Platforms

Neural networks are becoming more and more popular in scientiﬁc ﬁeld and in the industry. It is mostly because new solutions using neural networks show state-of-the-art results in the domains previously occupied by traditional methods, eg. computer vision, speech recognition etc. But to get these results neural networks become progressively more complex, thus needing a lot more training. The training of neural networks today can take weeks. This problems can be solved by parallelization of the neural networks training and using modern clusters and supercomputers, which can signiﬁcantly reduce the learning time. Today, a faster training for data scientist is essential, because it allows to get the results faster to make the next decision. In this paper we provide an overview of distributed learning provided by the popular modern deep learning frameworks, both in terms of provided functionality and performance. We consider multiple hardware choices: training on multiple GPUs and multiple computing nodes.


Introduction
Neural networks are currently one of the most popular methods for creating AI systems. They show state-of-the-art results in many areas, including computer vision [55,59,62,69], natural language processing [53,58], speech recognition [56,71]. However, to achieve such results, neural networks are becoming more and more complex and more and more data is needed for their training [53,60,64]. As a result, the training of such neural networks requires considerable time: days, weeks, and even months.
Problems with training neural networks can be solved using modern clusters and supercomputers. In this case, the neural network is trained in parallel on several computing nodes of the cluster, which can significantly reduce the learning time [54]. In addition, you can train a large network on a parallel computing system that does not fit in the memory of one computer [52,60]. As a result, distributed training of neural networks is rapidly gaining popularity.
In this paper, we provide an overview of distributed training of neural networks that exist in modern deep learning frameworks. The usage of various hardware options for distributed training are considered: distributed training on several GPUs and several computing nodes. It also describes approaches to the logic of organizing distributed learning. The most popular approaches of distributed training are model and data parallelism. In data parallelism, each device contains a complete copy of the neural network and performs training on parts of the data. In the case of model parallelism, the neural network is divided into separate parts, which are distributed among devices and are trained on a complete data set.
There are two types of distributed learning organization: asynchronous and synchronous. In asynchronous training, parallelization occurs due to the breakdown of work between workers and parameter servers. Workers are used for training (independently of each other), and parameter servers only store model parameters and their modification. The synchronous approach is organized as follows. Workers have their own copy of the model, but the data is different. After the workers have processed their part of the data, they exchange results with each other and at the same time change the parameters of the model. You can combine these two methods: use synchronous learning within a node with several GPUs and use asynchronous learning between nodes.
Currently there is a large number of frameworks for training neural networks, the most popular of which are TensorFlow, Caffe, Caffe2, Torch, PyTorch, MXNet, Theano, PaddlePaddle, Microsoft Cognitive Toolkit, Deeplearning4j, Keras and OpenCV. All these frameworks can use parallel training on several computing cores and processors, conduct training on the GPU, but the possibilities of using distributed training are much more limited. In this review, we included libraries that can only parallelize the training of a neural network on several GPUs and computing nodes. Therefore, it was necessary to exclude such popular frameworks as Theano and OpenCV, in which distributed learning tools are not developed.
This paper discusses the possibility of using parallel computing systems only for training deep neural networks. The inference requires other optimizations, including the features of parallel execution and use of GPUs on mobile systems [70]. Supercomputers are rarely used for inference.
The rest of the paper is organized as follows. Section 1 describes the most common frameworks. The following items are highlighted for each framework: • A brief description of the development history and current status.
• Implementation features and a list of supported algorithms -we give description of the basic principles of the framework. The support for the implementation of the main types of neural networks (fully connected, convolutional (CNN), recurrent (RNN), deep autoencoders (DAE), generative adversarial network (GAN) and networks with Transformer architecture) is indicated for each framework. • Optimizations for Intel, Power, Nvidia platforms -describes the availability of optimizations, as well as their application in terms of availability. • Use of several GPUs for training -the possibility of using several GPUs for training both with the standard package and with the help of add-ons. • Support for training on multiple nodes -the ability to start training a neural network on multiple machines (for example, on different nodes of a cluster). Both standard support and the use of add-ons are considered.
Section 2 shows the comparison of frameworks both in terms of functionality and performance on the base models that are used by the community to evaluate the performance of various frameworks. It is worth noting that in Section 1 the questions of the frameworks performance are not addressed.

Overview
TensorFlow (hereinafter, TF) [47] is a relatively young framework for high-performance computing, which is mainly used as a framework for implementing deep learning methods and is developed by the Google Brains team. It was developed in a closed manner until 2015, but after its revamp it was released to the public. The TF core is mostly written in a combination of C++ and CUDA, but there are also parts written in Python. TF has an API for both Python and C++, but since TF is mainly used in Python, this review will look at the Python API.
TF is in an active development. Current version is TensorFlow 2.0, which focuses on ease of use, including the ease of distributed learning. Detailed development plans can be seen in [45].

Implementation features and a list of supported algorithms
TF relies on the concept of a computation graph that describes data and operations on it. Because of this mechanism, TF can be used for almost any calculation. Examples of simple graphs are shown in Fig. 1. The computation graph can be created by the user himsel. Also, the graph is created by default when the framework is used, which will be used if you do not specify which graph the operation is performed with. TF implements a lot of algorithms for both deep learning and traditional machine learning.
Inside the graph, the data is represented by the so-called tensors (n-dimensional matrices). The size of the matrix can be specified on its definition, or it may not be specified if the size of the matrix changes dynamically during the execution of the calculation graph.
The graph is calculated within the session, to where all the data necessary for calculation is passed, after which the graph is calculated, and the result of the execution is returned. In version TF 2.0, Eager execution has appeared, which makes it possible to start calculations without creating a session.
In TensorFlow you can implement algorithms that efficiently utilize tensors (for example, there are ray tracing implementations on TF). But since all the necessary algorithms for deep learning are provided inside the framework, it is most often used for deep learning. In TF most of the optimizers (gradient descent, Adagrad, AdaDelta, momentum, Adam, etc.) and activation functions (relu, relu6, elu, sigmoid, tanh, softplus, softsign, dropout, softmax etc.) are implemented.
In TensorFlow you can implement all the main types of neural networks: fully connected, convolutional, recurrent, deep autoencoders, GAN and Transformer.  TF optimization for Intel are mentioned even on the official TF website. For example, TF will  work optimally if it is built from the source files provided on the official TF website, which will  incorporate all the features of Intel processors. Also, Intel has MKL-DNN, which is supported  by TF (TF with Intel MKL-DNN). TF is available in the PowerAI framework package, which includes Power-optimized versions of the popular frameworks and Distributed Deep Learning (DDL), which is a communication library optimized for distributed training of neural networks. It is worth noting that PowerAI supports only GPU systems.
It is also worth noting that the versions of Intel and PowerAI do not differ in terms of implemented algorithms. This is also true for subsequent frameworks, unless otherwise is specified.
TF has an optimized version for Nvidia GPUs, which can be downloaded from the official website. This version uses the CUDA library, which is used to run general-purpose computing on the GPU, and cuDNN, containing effective operations for working with neural networks on the GPU. In addition, NVidia has a special platform called the Nvidia GPU Cloud, which has a container docker directory with preinstalled and optimized NVidia software, including TensorFlow.

Multi-GPU and Multi-node training of neural networks
The TF framework was developed with the goal of providing distributed training in large clusters, so it includes training tools on several GPUs and computing nodes.
Data-parallel training in TF is implemented in the Distribution Strategy API (tf.distribute.Strategy), which includes several strategies for distributed training [14]. The most popular approach is synchronized distributed training on several GPUs, both on one node (Mir-roredStrategy) and on several nodes, each of which can have several GPUs installed (Multi-WorkerMirroredStrategy). It is also possible to use asynchronous training using the Parame-terServerStrategy strategy. The strategy allows the use of several nodes as parameter servers and workers, and nodes with workers can contain several GPUs. When using the Distribution Strategy in conjunction with the high-level TensorFlow APIs (Keras and Estimators), parallelization of training is performed automatically, both on several GPUs within the same node, and between cluster nodes. Developing your own layers or training methods requires you to explicitly use the parallel programming.
Model-parallel training in TF is possible using the Mesh TensorFlow library [66]. This library is an add-on for TF and defines the language of distributed processing of tensors. The library allows you to explicitly indicate by which dimension the tensors will be divided for distributed processing. As a result, it is possible to parallelize operations both by model and by data, which provides data-parallel or model-parallel training, as well as its combination. Training is performed in synchronous mode; all synchronization operations are provided automatically by the Mesh TensorFlow library.
There is a separate library from TF, Horovod [65], which runs on top of TF and PyTorch and provides a simpler interface for implementing distributed synchronous training on clusters. The library uses MPI for communication between processes, and it is also possible to use several GPUs on a node. More details about the use of Horovod can be found in [20]. It is worth noting that Horovod can be used with the DDL communication library from IBM; details are presented in [13].

Overview
Caffe (Convolutional Architecture for Fast Feature Embedding) [57] is a deep learning framework developed by the University of California, Berkeley. The main emphasis in this framework was made on the task of image classification and segmentation. Caffe is distributed under the BSD license. It is believed that Caffe is one of the very first successful deep learning frameworks, considering many research papers that have used this framework. Caffe has a library of pre-built and pre-trained models of neural networks that have been successfully used in a particular subject area. This resource is called Model Zoo and it is considered a big advantage of Caffe, since there are many useful models there, which is why other frameworks usually implement Caffe model converters into their own.
The framework is written in C++ with an interface for Python. Caffe is no longer supported, since the development team has switched its interest towards the development of Caffe2 [5]. At the end of March 2018, the PyTorch and Caffe2 teams merged, after which Caffe2 moved to the PyTorch repository and became a part of it. The latest version of Caffe, 1.0, was released on April 18, 2017.

Implementation features and a list of supported algorithms
In Caffe, everything is built on layers. Layer is a description of a data processing operation. It can implement both a neural network layer and pooling (non-linear compression of a feature map, in which a group of pixels is compressed to 1 pixel using a non-linear transformation, such as a maximum function), filters, non-linear transformations, an error function, etc. The neural network (Net) itself will consist of many layers, which must be described in the configuration file in the Protobuf format.
The layers also have two functions: forward and backward. In the forward function, the error function is calculated; in the backward function the gradients are calculated to further change the model parameters. Inside the layer, two versions of each of these functions are implemented, one for the CPU, the second for the GPU. Depending on the mode (CPU or GPU), which is specified at the beginning of the program, the corresponding method will be called (CPU is used by default). In the neural network itself, the functions of the forward and backward passage will invoke the corresponding functions of the layers included in the neural network.
The final element is the optimizer (Solver), which implements the optimizer of your choice. Supported: SGD, AdaDelta, AdaGrad, Adam, Nesterovs Accelerated Gradient and RMSProp.
Initially, Caffe implemented algorithms only for machine vision (based on convolutional neural networks), but later added support for recurrent neural networks in the form of LSTM. Now it is possible to implement fully connected, recurrent, convolutional networks, deep autoencoders and GAN. Transformer network implementation was not found.

Intel, Power, Nvidia platform optimizations
Intel provides Intel Distribution of Caffe -a version optimized for Intel processors [24]. This version of Caffe uses Intel MKL-DNN, which, in its turn, allows you to use OpenMP by default.
Caffe2 worked together with Intel to integrate Intel MKL functions to optimize the performance of the framework when used in production, but at the moment Caffe2 does not have full Caffe and Caffe2 support CUDA and CuDNN, and also have containers in the Nvidia Cloud.

Multi-GPU training of neural networks
To use several GPUs for training in Caffe, you must use your own Caffe scripts; a complete guide is provided in [4]. Synchronized distributed training is used. Asynchronous distributed training, as well as model parallelism, are not supported.
You can use a special version of Caffe, NVCaffe, which is supported by NVidia. This version was created specifically for the use of several GPUs. User instructions can be found in [35].
Caffe2 has a special distributed training module that implements various algorithms (for example, SynchronousSGD) of synchronous distributed training models. Caffe2 uses NCCL for synchronization, a communication library for multi-GPU nodes, which provides very good scalability. An example of use is given on the official website for training the ResNet-50 neural network both at several GPUs and at several nodes in [6]. A more general instruction is provided in [7]. Model parallelism in Caffe2 is possible, but manual distribution of neural network layers across devices is required before training the model.
Caffe2 has no official support for asynchronous training.

Multi-node training of neural networks
When using Intel Distribution of Caffe, [19] provides guide on setting up and starting the training process. Multi-node mode uses synchronized distributed training. For synchronization after processing their data, the nodes use the Intel Machine Learning Scaling Library (MLSL) for communication, which implements communication primitives for both model and data parallelism. But for multi-node training you need to configure the nodes separately, therefore the problem: the user may not know which nodes his application will work on.
For HPC clusters with GPU nodes, Inspur developers developed a version of Caffe -Caffe MPI [3] -which utilizes MPI to synchronize results between nodes after processing its part of the data.
In Caffe2, similar to multi-GPU training, you can also run model training on multiple nodes. You can select several communication backends to synchronize the results: Gloo and Redis. It is also worth noting that it is necessary for all nodes to have a common shared folder through which, after starting the program, the nodes can detect each other and start the training. More detailed instructions for using this mode are given in the ResNet-50 training example in [6].

Overview
Torch [51] -MATLAB-like framework for the Lua programming language, which provides a huge set of algorithms for deep learning. The core of the framework is written in C, but the Survey on Software Tools that Implement Deep Learning Algorithms on Intel/x86 and... programs themselves are written in Lua. LuaJIT is used for JIT compilation, which significantly speeds up the program. Framework is distributed under the BSD license. Despite the fact that Torch is a very powerful and convenient framework, programs for the framework must be written in Lua. As noted in [31], Lua is not a popular programming language in the field of data science. Users work around this problem by using the PyTorch package, which has an interface for Python and not only includes all the functionality of Torch, but also complements it.
Torch is not in active development, as the team switched to PyTorch development. The latest version, torch7, was released on February 27, 2017.

Implementation features and a list of supported algorithms
The core of the framework is in the torch package, in which there is a tensor implementationsimilar to TensorFlow, n-dimensional arrays, basic indexing operations, getting array slices, transposing, etc., as well as BLAS operations, implementation of mathematical functions (max, min, etc.) and statistical distributions.
The next important package is nn, which is used to build neural networks. The package has many different parts, but every one of them has a common part called "Module", which implements the forward and backward functions, that allow you to make forward and backward passes through the network. Modules can be connected using classes such as Sequential, Parallel, Concat, which allows you to build complex models of neural networks. There is also a list of basic modules, such as Tanh, Linear, Max, etc.
Error functions are implemented as subclasses of the Criterion class, which is similar to the Module class, since it has the same forward and backward functions. Torch include implementations of the basic error functions, such as cross-entropy, mean squared error, and stochastic gradient descent.
Torch allows you to implement all the most popular types of neural networks: fully connected, convolutional, recurrent, deep autoencoders, GAN and Transformer.

Intel, Power, Nvidia platform optimizations
Torch has a separate development branch that has Intel optimizations, especially for Intel Xeon. You can use Intel MKL in conjunction with Torch. The repository branch is located in [25].
Torch is included in the PowerAI framework.
Optimizations for the NVIDIA platform are included in cutorch -the CUDA backend for Torch. Torch supports CUDA and cuDNN. Nvidia Cloud does not contain a container with an optimized version of Torch.

Multi-GPU training of neural networks
Support for multiple GPUs is included in the standard version of Torch. To utilize multi-GPU training, you need to use the module in nn -DataParallelTable -which allows you to use data parallelism. Torch supports synchronous distributed training. It should be noted that parallelization is done quite simply: you just need to wrap the Module that implements the neural network in a DataParallelTable with the list of GPUs on which the training will be conducted. A more detailed guide can be found in [9].
There is no official support for asynchronous training and model parallelism with Torch.

Multi-node training of neural networks
Torch does not have official support for distributed training on multiple nodes, but there is a separate development branch, Torch MPI, which allows you to use nodes not only exclusively with the CPU, but also with the GPU. More detailed information can be found in [46].

Overview
PyTorch [42] is a deep learning framework that was based on Torch and is developed primarily by the Facebook's artificial intelligence research group. PyTorch also includes the Caffe2 framework, so PyTorch has advanced distributed training capabilities.
Pytorch is currently under active development. The stable version, 1.3.0, was released on October 10, 2019.

Implementation features and a list of supported algorithms
Due to the fact that PyTorch is based on Torch, they are very similar. The main module, torch, includes tensors and operations (transformations, mathematical functions, etc.), serialization operations.
The next important module that is used to build neural networks: torch.nn. This module includes a description of the base class Module, which is inherited by all types of neural network layers implemented in PyTorch (convolutional, pooling, linear, recurrent, etc.). The module has implementations of a large number of activation functions (for example, ReLU6, sigmoid, softsign, tanh, etc.), error functions (MSE, L1Loss, CrossEntropy, BCELoss, etc.). Each module has, similar to Torch, forward and backward functions. The neural network itself is a Module object, that is a combination of other Module, which allows to automatically make forward and backward passes through the resulting network.
With PyTorch, you can implement almost all the types of architecture used in neural networks: fully connected, convolutional, recurrent, deep autoencoders, GAN, and Transformer.

Intel, Power, Nvidia platform optimizations
Intel optimizations mainly consists of integrating Intel MKL-DNN into PyTorch [22]. Prior to version 1.0, the PyTorch distribution did not utilize Intel MKL-DNN, but now PyTorch supports it by default.
IBM has included PyTorch in PowerAI, which uses OpenBLAS. For NVidia, PyTorch is supported on Nvidia Cloud. It is also worth noting that there was a collaboration between PyTorch and Nvidia, which resulted in Apex -A PyTorch Extension -a set of utilities that make it easy to use distributed training technologies and Automatic Mixed Precision. PyTorch supports CUDA and CuDNN.

Multi-GPU training of neural networks
Just like Torch, PyTorch has a DataParallel module in the nn package, and if you wrap a neural network model in this module, then the training will be done on several GPUs. It uses data parallelism and synchronous distributed training. Model parallelism is also possible: when defining a model, it is necessary to distribute the layers across different devices and implement its forward and backward pass function, including data transfers between GPUs [43].
PyTorch has no official support for asynchronous training.

Multi-node training of neural networks
PyTorch has a DistributedDataParallel module that uses the torch.distributed package to parallelize training across multiple processes/nodes. This module can be used in two modes: 1 process with several GPUs and several processes with 1 GPU. Developers do not recommend using the use case with several GPUs, noting that the framework will work faster if you create 1 process for each GPU. The module supports 2 backends for communication: gloo and nccl. Developers recommend using the nccl backend, as it showed better performance in their experiments.
In version 1.0, developers changed the core of the torch.distributed module, and now it depends on the C10D library, which works asynchronously for all supported backends. It sped up all of its dependent modules (DataParallel, DistributedDataParallel).
It is also worth noting that Horovod supports PyTorch. Instructions can be found in [20].

Overview
MXNet [1] is a popular deep learning framework known for its flexibility and ability to scale across multiple GPUs and nodes. It is developed by a team from the Apache Software Foundation and distributed under the Apache 2.0 license. The core is written mainly in C++; there are interfaces for a large number of programming languages: C++, Python, R, MatLab, JavaScript, Go, Scala, Perl, Julia, Wolfram Language. Due to its scalability, Amazon has chosen MXNet for its AWS cloud environment. Like Caffe, MXNet has its own library of pre-trained models -Gluon model zoo.
MXNet is currently under active development. The latest version, 1.5.0, was released on June 8, 2019.

Implementation features and a list of supported algorithms
MXNet is similar to TensorFlow: it operates on NDArray (similar to a tensor, it is an ndimensional array) and a computation graph. Graphs include variables -objects whose type and size are not determined during its initialization, but will be calculated as the data is fed into the graph.
A neural network is built using the Module API, which implements almost all common types of layers of neural networks, activation functions and optimizers.
MXNet has a higher level and simpler interface for creating neural networks -Gluon. Gluon is very similar to Keras API in its simplicity of neural network creation. An initial guide to using Gluon can be found in [2].
MXNet supports most popular neural network architectures such as fully connected, convolutional, recurrent, deep autoencoder, GAN, and Transformer (using the GluonNLP extension).

Intel, Power, Nvidia platform optimizations
Intel and Apache MXNet released version 1.2.0, in which the main point was to optimize MXNet for CPUs using Intel MKL-DNN, which significantly accelerated the work of the framework. Detailed information about both the installation and how MXNet has accelerated can be found in [17].
Authors could not find the information about optimizing the framework for the Power architecture. MXNet is not part of PowerAI but installing MXNet is still possible.
For NVidia, MXNet is supported by Nvidia Cloud. MXNet also has support for CUDA and CuDNN.

Multi-GPU training of neural networks
By default, data parallelism is used in MXNet, but model parallelism is also supported. [34] provides an example of using several GPUs for parallelizing a LSTM model.
Data parallelism is quite simple. When initializing the module, it is necessary to give a list of GPUs on which training will be conducted. There is also built-in support for static load balancing: if one GPU is faster than another, then you can set the proportions in which the GPUs will process the data. Detailed information on GPU parallelization can be found in [32,33].

Multi-node training of neural networks
MXNet officially supports both asynchronous and synchronous distributed training. For distributed training, MXNet uses three kinds of processes. The first of these is a worker in which training will occur, and which can use multi-GPU training. The second process is called the server (similar to the parameter server in Tensorflow), which stores the model parameters. There can be several servers, and they can be located both on the machine with the worker or on another machine. Servers store parameters in a key-value format. The third type of process is the scheduler, which initializes the cluster and is responsible for ensuring that other processes can interact with each other.
The training depends on which mode the server is created with. There are 4 modes: dist sync -for synchronous training; dist async -for asynchronous training; dist sync devicesimilar to dist sync, but is used for training on several GPUs, which allows to skip timeconsuming communications between the CPU and GPU and synchronize the results only between GPUs (this method uses more memory on the GPU); dist async device -similar to dist async, but is used for training on several GPUs.
If the communication becomes a bottleneck, you can use compression of the gradients to reduce the load on the communication.
It is also worth noting that MXNet added integration with Horovod for distributed training in 1.4.0 version.

Implementation features and a list of supported algorithms
The main concept in PaddlePaddle, like TensorFlow, is a computation graph, but the operation is different: in PaddlePaddle a Python program that describes a neural network model builds a computation graph in protobuf format [41], and then sends it for execution to the socalled Executor: either process responsible for distributed training, or to the libpaddle.so library for local execution. A description of how the architecture works and what were the reasons for such a design can be found in [45].
PaddlePaddle supports most of the neural network architectures used: convolutional, recurrent, fully connected, deep autoencoders, GAN, and Transformer.

Intel, Power, Nvidia platform optimizations
PaddlePaddle, like other frameworks, can use Intel MKL to speed up the framework on the CPU and Intel MKL-DNN to speed up the convolutional neural networks.
Information about optimizing the library for the Power IBM architecture could not be found. PaddlePaddle is supported on Nvidia Cloud and can also utilize CUDA and CuDNN.

Multi-GPU training of neural networks
PaddlePaddle has official support for multi-GPU training and has two ways to use it. The first way is that ParallelExecutor is used instead of the usual Executor. PaddlePaddle uses synchronous distributed training. Information on the implementation of asynchronous training in official sources could not be found. More detailed instructions can be found in [39]. Developers state that current version of Fluid provides only data parallelism training mode [38].
The second way is to compile the computation graph using CompiledProgram, which transforms the graph for faster execution. Then you need to call with data parallel, which transforms the graph so that several devices (in CPU mode, threads) can be used. PaddlePaddle automatically detects all available GPU devices and distributes the work between them. Developers recommend using this method. An example of usage can be found in [40].

Multi-node training of neural networks
PaddlePaddle has official multi-node support, and there are two possible uses. In case of the RPC communication backend, training is based on the Trainer-ParameterServer architecture. Both synchronous and asynchronous distributed training are supported. Trainer is engaged in the training, and the ParameterServer is responsible for storing and modifying the model parameters. The peculiarity is that it is necessary to start the processes of parameter servers and the training processes separately, in which it is necessary to specify the address and port of the parameter server. DistributeTranspiler is used to distribute training, which, depending on how many trainers and parameter servers, can give individual processes the computation graph that they need. The output computation graphs also contain all the routines of synchronization and parameter updates.

Implementation features and a list of supported algorithms
CNTK is similar to the Keras API in building neural networks. CNTK provides basic modules that implement the most used units in neural networks.
The cntk.layers module includes various types of layers of neural networks: recurrent, Embedding, Dense, etc. Basic models, which consist of a sequence of layers of different or identical types, can be built using Sequential. The cntk.learners module provides set of the most popular optimizers that are used in practice, such as SGD, Adam, Nesterov, RMSProp, etc. The cntk.losses module implements error functions like binary cross-entropy, squared error, etc.
After constructing the neural network model from the provided blocks, it is necessary to create a Trainer object, which receives the model at the input and the selected optimizer for training the model. Then you can start training the model using train minibatch.
CNTK supports such types of neural network architectures as fully connected, convolutional, recurrent, deep autoencoders, GAN and Transformer.

Intel, Power, Nvidia platform optimizations
Microsoft CNTK has two versions: CPU-only, which uses Intel MKL-DNN by default, and a version with a GPU that uses CUDA (CUB and cuDNN).
The only mention authors found about CNTK for Power platform is in Keras docs: Keras supports the CNTK backend.
CNTK is supported in the Nvidia Cloud.

Multi-GPU training of neural networks
CNTK has distributed training support. Synchronous (DataParallelSGD, BlockMomen-tumSGD, ModelAveragingSGD) and asynchronous (DataParallelASGD) optimizers can be used. The prerequisite for distributed training is that you need to install the MPI for communication between processes. CNTK does not support other communication backends (for example, you can use Gloo with Caffe2). It is worth noting that multi-GPU training on a node occurs through MPI, where each process uses 1 GPU, and processes are communicating through MPI.
Parallelization is quite simple -after the definition of the selected optimizer, you need to wrap it in data parallel distributed learner. A detailed guide to using the distributed CNTK package can be found in [30].
CNTK does not support model parallelism.
Survey on Software Tools that Implement Deep Learning Algorithms on Intel/x86 and...

Multi-node training of neural networks
Multi-node training is completely same as the multi-GPU training, since MPI is used for communication between processes for several GPUs as well as for several nodes.

Overview
Deeplearning4j [10] (hereinafter, DL4J) is a deep learning framework written in Java. It has interfaces for Java, Scala, Python (via Keras), Clojure. It is the first large-scale deep learning library for Java, and provides a convenient and scalable interface for distributed training when used with the default integration with Hadoop and Spark. DL4J needs an additional ND4J library if the GPU computing is needed that implements CUDA. DL4J is distributed under the Apache 2.0 license.
The development of the framework is quite slow. The latest stable version, 0.9.1, was released on August 12, 2017, and there is also a beta version 1.0.0-BETA4.

Implementation features and a list of supported algorithms
There are two ways to work with the framework. The first way most people deal with when they start working with this framework is the usage of MultiLayerNetwork. This is a high-level API for building neural networks consisting of a sequence of layers of a certain type. The syntax for this API is very similar to the Keras API.
The second method is the manual construction of a computation graph that describes the architecture of the neural network. It is worth noting that everything that can be done through MultiLayerNetwork can also be done by constructing a graph of calculations, but the configuration of this graph will be much more complicated. However, this approach allows you to implement any desired network architecture.
For data processing, the DataVec module is included in the framework, which implements almost all the necessary functions for loading, saving and converting data, and developers recommend its use wherever possible.
DL4J uses an additional ND4J library to work with tensors, which provides the ability to work with n-dimensional arrays, as well as the ability to use not only CPUs for processing, but also GPUs.
The framework supports most of the popular types of neural network architectures: convolutional, recurrent, fully connected, deep autoencoders. In DL4J there is no way to create a GAN by your own means, but you can import the GAN model described in Keras. Transformer implementation in DL4J could not be found.

Intel, Power, Nvidia platform optimizations
Like other frameworks, DL4J can utilize Intel MKL BLAS. DL4J was optimized for Power platforms, the process of which is described in [11], but the information is outdated, since the page to which the link had been provided in the article to the optimized version of the library no longer exists.
NVidia GPUs can be used in DL4J via ND4J library, which supports the CUDA and cuDNN libraries. DL4J is not supported in Nvidia Cloud. D. Shaikhislamov, A. Sozykin, Vad. Voevodin

Multi-GPU training of neural networks
DL4J supports multi-GPU training, and the parallelization process is quite simple and transparent for the developer: after defining the model using, for example, MultiLayerNetwork, you need to pass it to the ParallelWrapper and start the training process. This wrapper implements synchronous distributed training; synchronization of model parameters are implemented in the wrapper. A detailed guide can be found in [16]. It is claimed that DL4J supports model parallelism, but the authors could not find any guides.

Multi-node training of neural networks
Multi-node training, similarly to PaddlePaddle and MXNet, takes place according to the parameter server -worker architecture, in which the worker processes their part of the data, sends the results to the parameter server, which, after modifying the parameters, sends all updated model parameters to everyone. All examples of usage provided on the official website are designed to work on Spark clusters [12,68].

Overview
Keras [50] is an open library used to work with neural networks that was written in Python. The library was developed as part of the research efforts of the ONEIROS project and is an add-on to other frameworks (front-end) for deep learning. The basic principle followed by the developers is to make the interface between the developer and the backend as intuitive and convenient as possible for quick development. Keras supports the following frameworks as a backend: TensorFlow, Theano, CNTK, MXNet.
Keras is under active development. The latest version, 2.2.4, was released on October 3, 2018. In addition, Keras has been included with TensorFlow and has been recommended to use as a high-level API since TF 2.0. The TensorFlow version of Keras includes a large number of optimizations for TensorFlow that are not found in the standalone version of Keras.

Implementation features and a list of supported algorithms
There are two ways to build neural network models in Keras -using the Sequential class or functional API.
Sequential is a tool for building neural networks in which layers go sequentially one after another. Keras implements a large number of layers of neural networks, such as LSTM, layers for convolutional neural networks (Conv1D, Conv2D, Conv3D), etc. After building the model, you need to compile it: call the compile method with the specified optimizer and error function. Keras provides an extensive set of implemented optimizers, such as SGD, Adam, Nadam, Adagrad, RMSProp, etc.
Using the functional API makes it possible to implement more complex types of neural networks in which layers can connect to each other arbitrarily, including several layers that work in parallel. One example of using the functional API is to build a neural network with multiple inputs. This neural network receives part of the input data at the input of the first layer, and the next part of the input data is supplied to the neural network only after merging Survey on Software Tools that Implement Deep Learning Algorithms on Intel/x86 and...
with the output of one of the internal hidden layers. A more detailed guide to the functional API and examples of its use can be found in [18].
Keras supports most of the popular neural network architectures: convolutional, recursive, fully connected, deep autoencoders, GAN and Transformer.

Intel, Power, Nvidia platform optimizations
Due to the fact that Keras is the interface between the developer and the backend, all the optimizations applied to the framework chosen as the backend are also applicable to Keras.
At Nvidia Cloud Keras is available as part of the container with TensorFlow.

Multi-GPU training of neural networks
Keras officially supports multi-GPU training, but only with the TensorFlow backend using the Distribution Strategy API.
It is also possible to use distributed training with the MXNet backend. To use the MXNet backend, you must use the Keras version supported by MXNet developers [26]. When compiling the model, it is necessary to pass context parameter to the input of the method, which contains a list of GPUs on which training can be conducted. A more detailed usage guide is provided in [44].
There is no support for model parallelism in Keras.

Multi-node training of neural networks
In Keras with the TensorFlow backend, multi-node training support is available in an experimental mode with the MultiWorkerMirroredStrategy strategy. Alternatively, you can use the Horovod framework, which has good support for Keras models. Table 1 provides a comparison of frameworks by common parameters. The designations used in the application: sync / async -support for synchronous / asynchronous distributed training, "?" in the column about the availability of optimizations for the platform -the authors could not find information on the highlighted items. A + in the column about the maximum number of GPU / nodes on which training was started means that there is support for the specified mode, but the authors could not find quantitative results.

Basic Comparison
In terms of the supported types of neural networks, the frameworks are very close. Almost all frameworks, excluding only Caffe and DeepLearning4J, support the popular neural network architectures: fully connected, recurrent, deep autoencoders, GAN, and Transformer. Caffe and DeepLearning4J do not support architectures that have become popular relatively recently: GAN and Transformer. Most of the frameworks are implemented in C++ for high performance but provide APIs in many languages. The most popular API is for Python, which allows you to quickly develop prototypes of neural networks. An exception is the Torch framework, which is written in C and Lua and provides APIs in the same languages. However, Torch is now inferior in popularity to PyTorch, which provides a Python API. Another exception is the DeepLearning4J framework, which is written in Java and provides an API for languages that use the JVM. Thanks to this, DeepLearning4J integrates well in distributed computing systems from the JVM ecosystem: Hadoop and Spark.
All frameworks support synchronous distributed training, while TensorFlow, Torch, MxNet, CNTK and Keras additionally support asynchronous. Also, all frameworks provide the possibility of distributed training using data parallelism, and TensorFlow, Caffe2, MxNet and Pytorchusing model parallelism. Thus, TensorFlow and MxNet have the advantage in the amount of distributed training modes.

Scalability Comparison
In a review of frameworks for distributed deep learning [63], the authors looked into the simplicity of parallelization, as well as performance on both several GPUs and several nodes. For testing, the authors used the AWS P2.xlarge cloud platform, where each entity was equipped with one NVIDIA Tesla K80 and four 2.7GHz Intel processors. For communication between nodes, 10Gbps Ethernet is used, which can greatly affect their performance, but the authors argue that if all the training data was downloaded to the nodes beforehand, then this communication network is sufficient to transfer model parameters. Authors selected CNTK, TensorFlow, Caffe2, MXNet, and Chainer (not covered in this review). Also, OpenMPI 3.0.0 was used. All experiments were conducted for the ResNet50 neural network; Cifar-10 and ImageNet were used as data for training and testing.
As mentioned earlier, the article looked into the simplicity of the framework for distributed training. The authors concluded that TensorFlow has the most complex architecture and parallelization methods for the user, which is why TensorFlow is excluded in some tests. There is only one scenario with the CPU -for inference, all other scenarios use the GPU.   As you can see in the figures above, Chainer loses everywhere in terms of performance. On the CPU side, CNTK is clearly ahead of the competition in terms of performance, which shows how much this framework is optimized for working on the CPU. On the GPU side, all frameworks except Chainer show good performance, but you can highlight CNTK in the classification, and MXNet in training -these frameworks showed the best results.  Figure 4 shows the performance of distributed multi-GPU training. In the version of Tensor-Flow, used by the authors, it was necessary to manually implement the mechanism for updating parameters after synchronization, which is why the authors excluded the framework from this test. As you can see, CNTK and MXNet clearly stand out by showing near-linear acceleration and having better performance. The scalability of Caffe2 is also good, but it lags behind due to the training speed of 1 GPU, which is lower than that of competing frameworks.   Figure 5 shows the results of multi-node training performance. The authors tested scalability to just 8 nodes (1 GPU per node), but it still gives an idea of which framework scales better. TensorFlow was excluded from the tests with synchronous training due to the lack of a readymade solution for this use case. As you can see from the graphs, MXNet is the clear leader, which is far ahead of all the frameworks in all tests, showing scalability close to linear. In terms of scalability, MXNet generally shows the best results. However, the authors also checked the quality of the models in [63], since a higher speed of the framework does not mean that the quality of the resulting model will be higher. The training speed of MXNet is 1.5+ times higher than that of TensorFlow; however, MXNet in 30 epochs cannot achieve the accuracy on the test set, which TensorFlow reaches in 5, which indicates a very fast convergence in TensorFlow. Caffe2 can also be highlighted, which shows good convergence (the next after Ten-sorFlow), but which in terms of training speed is very close to MXNet (faster than TensorFlow by around 1.5 times).
A similar review was made in [67], where several neural network architectures were selected and CNTK, MXNet, Caffe, Torch, and TF were tested. This article did not use multi-node training, only multi-GPU training (maximum 4). The processing speed of one batch on several GK210 GPUs is shown in Tab. 2. For AlexNet-R and ResNet-56, Cifar-10 was used as data, for FCN-R -MNIST. According to the results, you can see that TF scales very poorly on this platform. Torch and CNTK are the best in scalability, but despite this, Torch loses in speed to both CNTK and MXNet.
In terms of the rates of convergence, authors drew a conclusion, that Torch and CNTK successfully cope with FCN-R; MXNet with Torch show the best performance with AlexNet-R and ResNet-56.
In [49] authors explored Power AI DDL, which shows the results of the IBM-Caffe and Torch versions provided in PowerAI on ResNet-50 (ImageNet with 1K classes was used). The results are shown in Tab. 3 and 4. You can notice that these frameworks from PowerAI show very good scalability on Power IBM platforms, and the frameworks were tested on a large number of nodes, which is very rare. [28] shows the scalability of TF using the Horovod frontend for distributed training on two neural network architectures: Inception V3 and ResNet-101. The results are presented in Fig. 6. Details of what data and GPU are used were not indicated. It is worth noting that Horovod not   only provides more impressive results, but also has a more convenient and simple parallelization interface than the default TensorFlow.
Developers of PaddlePaddle benchmarked and compared their framework with PyTorch, MXNet and TensorFlow [37]. They focused on testing single-node multi-gpu Distributed Training and used SE-ResNeXt50, Mask-RCNN, DeepLab V3+ etc. models for evaluation. It is worth noting, that in every test PaddlePaddle is compared to only one of the competitors, eg. Py-Torch for SE-ResNeXt50 and TensorFlow for DeepLab V3+. Table 5 shows the results of the experiments. It is clear, that in these experiments PaddlePaddle is superior to all of the other frameworks in terms of training speed, but it is hard to evaluate the quality of the trained models, because developers did not consider the rate of convergence of the models.
In the case of PyTorch, developers provided scalability data in their talk at GTC 2019 ( Fig. 7 and 8). As you can see, PyTorch shows very good scalability while increasing number of nodes. Developers also noted that switching to the c10d backend accelerates training by 19%.

Conclusions
Distributed training of neural networks is becoming increasingly popular. It not only allows you to reduce the training time of a neural network, but also makes it possible to train large neural networks that cannot be fit into the memory of one machine. However, not all deep learning frameworks manage to develop distributed training quickly enough: while Multi-GPU and multithreaded training is present in all popular frameworks, the situation with distributed multinode training is much worse. There is a good support for distributed training in the frameworks that were created with distributed training in mind: TensorFlow, MXNet and PadlePadle, as well as PyTorch and DeepLearning4J (due to integration with the Spark infrastructure).
The most popular method of distributed training is synchronous training with data parallelism. Tools for model parallelism have only recently begun to actively develop. There is Mesh TensorFlow solution; PyTorch provides the ability to manually implement model parallelism. Effective training on large supercomputers with a large number of nodes is impossible without model parallelism. We can expect that active researches will be carried out and new frameworks will appear in this area in the near future.
A large role is played not only by the performance and technical capabilities of the framework, but also by the ease of use of distributed training. This is because in order to search for a neural network that provides the necessary quality for solving the problem, it is necessary to train a large number of neural networks with different architectures and hyperparameters. Therefore, third-party libraries such as Horovod, which make distributed training using the TensorFlow, PyTorch, and MXNet frameworks easy and convenient, are gaining popularity. The TensorFlow framework is moving in the same direction, where in version 2.0 Keras has become the recommended high-level API which has integrated all distributed training tools.
You can also speed up the search for the most suitable neural network architectures using automated optimization tools for distributed hyperparameters, such as HyperOpt [48], Tune [61], Keras Tuner [27], etc. Due to the growing popularity of distributed training of neural networks, one can also expect the development of tools for distributed optimization of hyperparameters.
The process of consolidation of training frameworks for deep neural networks should not be missed out, primarily based on frameworks with the support of large Internet companies with large financial and computing resources: TensorFlow from Google and PyTorch from Facebook. The Keras framework has been included into TensorFlow; Horovod is underway to be included into the TensorFlow Distribution Strategy API [15]. PyTorch included the capabilities of the classic Torch, and it also included the Caffe2 framework, which has powerful distributed training tools. Other frameworks, unfortunately, do not develop in the field of distributed training as fast as TensorFlow and PyTorch do. It is most likely that in the future the trend of consolidating and including open projects for distributed training of neural networks and optimization of hyperparameters into TensorFlow and PyTorch will continue.
Hardware architectures specially designed for training neural networks, such as Google's Tensor Processing Unit (TPU) [8], Graphcore's Intelligence Processing Unit (IPU) [23], Nervana [21] from Intel and others, are also of interest for the development of distributed training. These architectures not only allow you to accelerate the training of neural networks, but also significantly affect the development of deep learning frameworks. In particular, the Mesh Ten-sorFlow library was developed on the assumption that it will work on an n-dimensional grid of computing devices, which is typical for the cluster architecture on TPU [66]. The performance and quality of training a neural network in Mesh TensorFlow was evaluated in a cluster with TPU. Although Mesh TensorFlow may work in a cluster with a different architecture, in that case performance will be lower. It is likely that over time the integration of DL frameworks and specialized equipment will become so deep and effective that it will become unprofitable to use clusters with a CPU or GPU to train neural networks.