Deep Learning on Computational-Resource-Limited Platforms: A Survey

Nowadays, Internet of ,ings (IoT) gives rise to a huge amount of data. IoTnodes equipped with smart sensors can immediately extract meaningful knowledge from the data through machine learning technologies. Deep learning (DL) is constantly contributing significant progress in smart sensing due to its dramatic superiorities over traditional machine learning. ,e promising prospect of wide-range applications puts forwards demands on the ubiquitous deployment of DL under various contexts. As a result, performing DL on mobile or embedded platforms is becoming a common requirement. Nevertheless, a typical DL application can easily exhaust an embedded or mobile device owing to a large amount of multiply and accumulate (MAC) operations and memory access operations. Consequently, it is a challenging task to bridge the gap between deep learning and resource-limited platforms.We summarize typical applications of resource-limited deep learning and point out that deep learning is an indispensable impetus of pervasive computing. Subsequently, we explore the underlying reasons for the high computational overhead of DL through reviewing the fundamental concepts including capacity, generalization, and backpropagation of a neural network. Guided by these concepts, we investigate on principles of representative research works, as well as three types of solutions: algorithmic design, computational optimization, and hardware revolution. In pursuant to these solutions, we identify challenges to be addressed.


Introduction
e last decade has witnessed exciting development of deep learning (DL) technologies, which contributes dramatic progress in signal and information processing applications including IoT and smart sensing. A deep neural network (DNN) comprises multiple neuron layers organized in a hierarchical structure. Parameters of every layer can be learned through iterative training. A well-trained DNN can distill useful features from raw data. All training samples are manually labeled. In one layer, input data can be mapped into a low-dimensional space through feature extraction. Subsequently, output features of the current layer are exported into the next layer. Outputs of the last layer imply the learned labels. A DNN can be fine-tuned through minimizing the error between manual labels and learned labels [1].
Deep learning enjoys significant advantages over traditional machine learning [2,3]. First, deep learning can achieve superior performance when data volume is massive.
is means that deep learning can fully benefit from the huge amount of data collected by IoT. Traditional machine learning techniques are preferable when data volume is small. However, the performance prominently degrades when data volume is extremely large. In contrast, deep learning exhibits advantageous scalability with massive data. Second, deep learning relies less on feature engineering. IoT can gather diversified categories of data that are distinct in nature. Manually extracting features of heterogeneous data is a daunting task. Traditional machine learning requires a domain expert to extract features. e manually identified features expose underlying patterns to algorithms. Nevertheless, deep learning autonomously extract features in a layer-wise manner to represent input samples with a nested hierarchy of features. Every layer defines higher-level features based on lower-level features extracted by the previous layer. ird, deep learning techniques can outperform traditional ones in terms of various smart-sensing-related tasks, such as computer vision, speech recognition, and human behavior understanding.
Deep learning is permeating into diversified aspects of human society, which puts forwards urgent demand on the ubiquitous deployment of DL-powered applications. In other words, deep learning is required to be fit into resourcelimited platforms like smartphones or wearable devices. Nevertheless, matching DL and resource-limited platforms is a challenging task. Inferencing with DL is extremely resource-consuming (processor, memory, energy, etc) even though the more resource-consuming training phase can be offloaded onto high-performance-computing-powered mainframes. We investigate on typical resource-limited DL inferencing solutions by categorizing the solutions and discussing open questions. e rest of this paper is organized as follows. Section 2 clarifies impetus of developing resource-limited DL. Representative solutions are discussed in Section 3. Section 4 points out the challenges to be addressed. Section 5 concludes our work.

Computational-Resource-Limited Context of Deep Learning
2.1. Application Scenarios. Figure 1 shows typical applications of computational-resource-limited DL in the smart sensing context, including self-driving [25,26], artificial intelligence APPs of smartphones [27], health/homecare robots [28][29][30][31], and intelligent wearable devices [32]. e DNN can be pretrained on remote cloud while the mobile DL platforms communicate with the cloud and perform inference based on local computational and energy resources [33]. All these applications rely on embedded computer with limited onboard resources such as processor, memory, and battery. Two fundamental technologies of such applications are sensor data processing and computer vision.
Recognizing and feeding back to user behavior and surrounding environment are the core functionalities of state-of-the-art Internet-of-ings (IoT) and mobile sensing applications. Nevertheless, raw sensor data are inevitably mixed with noise and uncertainty due to the complicated deployment environment. As a result, distilling precise and meaningful knowledge from raw sensor data is a challenging task. DL is one of the most competitive methods to conquer this challenge [34]. e prevalence of wearable (head-mounted) augmented reality (AR) devices has open a way to a novel class of mobile computer vision applications, including the Microsoft HoloLens [35] and the Google Glass [36]. ese applications vary from real-time traffic signal identification for navigation to human recognition for healthcare APPs. All these application scenarios propose the common demand to process continuous video streams in real time. e current leading-edge technology of video stream processing is DL, which handles video streams using a large-scale and pretrained convolutional neural network (CNN) or recurrent neural network (RNN) [37].

A Perspective of Pervasive Computing.
Deep learning can automatically extract features and achieve higher accuracy than traditional artificial intelligence techniques. As a result, deep learning is applicable to a broad range of scenarios. Additionally, open-source development tools like Tensor-Flow and Caffe are also speeding up progresses in deep learning. Research works on fitting deep learning into resource-limited mobile or embedded platforms will undoubtedly push a huge step forward towards the pervasive deep learning.
Deep learning is currently an indispensable impetus that advances the progress of pervasive computing. As shown in Figure 2, we summarize the development of pervasive computing into three stages. e hardware and software solutions of a former stage are incorporated into the latter stages. In the 1990s, researchers in this area try to facilitate the daily life of humans through Internet-interconnected desktops and mainframes. TCP/IP protocols account for the backbone of networks and the software layer of pervasive applications typically focuses on network organization and data delivery. In the following stage, the mobile Internet provides network access to users at any time and any place. IoT interconnects almost all digital sensors to collect raw data from diversified sources, which results in large data volume and puts forward high demand on the computing power of the data processing platform. us, distributed or parallel middleware like Hadoop aggregates the computing power of huge amounts of commodity servers. Additionally, cloud computing provides the aggregated supercomputing power to customers through Web Service. Data transmission between IoT and cloud platforms is further supported by WIFI and 3G/4G. However, applications of this stage mainly adopt traditional machine learning solutions, which cannot achieve constantly advancing performance with the continuous increase of input data volume. Nowadays, the learning and inference accuracy of DNN can efficiently scale with the input data amount. However, high time and memory overheads impede the deployment of DL on resource-limited platforms. Matching deep learning and hardware platforms is an active research area. Software layer solutions mainly focus on simplifying the trained DNN to approximate a full-status DNN. Hardware layer solutions involve embedded GPUs, artificial intelligence chips, or even    analog computing based on new nonvolatile memory. Additionally, 5G will meet even higher bandwidth requirements.

Computational Predicament of DNN: A Perspective of
Underlying Principles. Classification is a typical application scenario of DNNs. Under this scenario, the target is to establish a mapping from input samples to corresponding labels. e following concepts are the cornerstones to exploit the learning and inference of DNNs: hypothesis space, capacity, stochastic gradient descent, and generalization [38].
Hypothesis space is the set of all functions generated by a neural network. One function is obtained by fitting part of parameters of the neural network and can map homogeneous samples to the same label. Training a neural network is to search the optimal functions in the hypothesis space, which can build mapping relationships specified by the training data (in other words, minimizing the training error). As a result, the size of hypothesis space determines the potential ability of a neural network to find optimal functions.
Capacity of a neural network reflects the size of hypothesis space, as well as the upper bound of ability to fit functions. e optimal functions may be beyond the hypothesis space, if the capacity is not sufficiently large. In this case, the neural network can only search in the limited hypothesis space and find functions that approximate the optimal functions with best efforts. Consequently, underfitting is inevitable.
A trained neural network is expected to correctly predict the label of previously unseen samples. Generalization reflects this kind of ability. Lower generalization error means higher generalization ability. Underfitting during the training phase can result in large generalization error in the inference phase.
Capacity sets the limit of the fitting ability, while generalization can measure the ability of scaling with unknown samples. Another vital issue with neural networks is the mechanism of searching the hypothesis space in the training phase. Conventionally, the searching is manipulated by stochastic gradient descent; searching is always along the direction in which training error drops fastest. e gradients are backpropagated from the deepest layer to the first layer to update weights in a layer-wise manner. Backpropagation converges when the difference of train errors between two successive iterations is smaller than a threshold. However, stochastic gradient descent commonly cannot reach the global optima. Despite that a near-optimal solution is generally sufficient to train a lowerror neural network, this method typically requires a long time to converge. Moreover, parameters like step length should be carefully selected to avoid fluctuation of the gradient.
From the perspective of underlying principles, the computational predicament of DNNs is due to the following reasons. e first is memory overhead. Oversized network is a conventional method to achieve low generalization error. A large capacity does not necessarily result in low generalization error. However, a large hypothesis space raises the upper bound of the generalization ability and thus increases the possibility of reaching a low error, especially when the target functions are not excessively complex. e second is time and energy overhead. Backpropagation is inherently iterative and time-consuming. e gradient is calculated by minimizing the training error. e training error is a function of weights and other parameters. e huge number of weights leads to a slow convergence speed. Moreover, these weights need to be frequently transmitted between processing units and memory. Consequently, the long-time computation and intensive memory operations raise high demand on the processing ability and energy duration. In addition, values of hyperparameters are conventionally selected through fine-tuning, which multiplies the time overhead. e third is the curse of dimensionality. High dimensionality of data aggravates the computational resource consumption. DNNs commonly need a large volume of training data to guarantee the generalization ability of the trained network. Higher dimensionality requires denser samples. If A 1 is the number of necessary training data points in the one-dimensional sample space, then the number of training data points is A n 1 in n-dimensional sample space [38]. More training data points of higher dimension inevitably exacerbate overheads of memory, time, and energy.

Challenges to Be Investigated.
Deep learning is currently more art than a science. Neural networks are inherently approximate models and can often be simplified [39].
In spite of the dramatic learning power of deep learning, computational cost has impeded their portability to resource-limited platforms [40]. DL algorithms are facing three kinds of barriers to optimize computational performance. e first barrier is the resource-consuming iterative nature of DL training. Moreover, the experiential nature aggravates this kind of iterative cost. Up to now, the success of deep learning mainly relies on empirical designs and experimental evaluations. eoretical principles are still to be exploited. As a result, optimizing the performance of deep learning requires implementing and executing various possible models within the computational resource constraints to empirically recognize the optimal one [41]. Extracting meaningful knowledge from a single input sample can require enormous MAC operations. e number of MAC operations can reach the magnitude of billion [42]. Additionally, a single deep learning network can contain over a million parameters [43]. As a result, deep learning proposes high demands on processing ability, memory capacity, and energy efficiency. It is a vital issue to optimize deep learning networks by eliminating ineffectual MAC operations and parameters [42]. e second barrier is fitting DNNs into diversified modern hardware platforms. Different hardware platforms can be distinct in terms of clock frequency, memory access latency, intercore communication latency, and parallelism mode. Designer of DL model can be categorized into two different types: data scientist and computer engineer. Data scientists mainly concentrate on optimizing training and inference accuracy through data and neural network techniques. However, they have little or even no concern with computational cost. Efforts to upgrade accuracy do not necessarily result in smaller network size and higher speed. Computer engineers focus on accelerating deep learning based on hardware platforms. ey fine-tune or even reform DNNs to match the models to the design requirements for resource-constrained applications. e third barrier is lack of dedicated hardware. Traditional general-purpose digital computing hardware such as CPU, GPU, and FPGA neglect some unique characteristics of deep learning. For example, deep learning only involves limited kinds of computational operations. Additionally, deep learning is significantly tolerant to noise and uncertainty. Dedicated hardware may trade off universality for performance [44][45][46][47][48].
Cloud-powered DL has been an active research area. Such solutions can offload heavy computation onto the remote cloud hosts. Such methods assemble data from mobile or embedded devices, transfer the data to cloud, and perform deep learning algorithms (both training and inferencing) on cloud. Users are facing the risk of privacy leakage due to data transmission through computer networks, particularly if the data contain sensitive information. In addition, the reliability of cloud-based deep learning may be affected by network package loss or even network failure. In this paper, we focus on three issues: first, trade-off between neural network capacity and generalization error using algorithmic design; second, fitting DNN into digital hardware through computational design; and third, next-generation hardware to cope with the computational predicament of DNN. We categorize the existing solutions into three layers: the algorithmic, computational, and hardware layers. Figure 3 summarizes typical solutions. A practical method may integrate more than one solutions.

Algorithmic Design.
Algorithmic designs focus on reducing resource consumption through mathematically adjusting or reforming the DNN model and algorithm. Typical simplification techniques include depthwise separable convolution, matrix factorizing, weight matrix sparsification, weight matrix compression, data dimension reduction, and mathematical optimization.
Howard et al. designed a series of neural network models (MobileNets) to facilitate machine vision applications on mobile platforms [49]. MobileNets represent a kind of lightweight deep neural network based on depthwise separable convolutions. e main goal of MobileNets is to construct real-time and low-space-complexity models to satisfy the demands raised by mobile machine vision applications. e contributions of MobileNets are summarized as follows. First, core layers of MobileNets are derived from the depthwise separable convolution. e core concept of the depthwise separable convolution is to factorize a conventional convolution into a depthwise separable convolution layer and a pointwise convolution layer [50]. MobileNets adopt this core concept to reduce the model size, as well as the total number of multiplication and addition operations. Second, pointwise convolutions account for 95% of the total computation while the im2col reordering optimization is unnecessary for pointwise convolutions [51].
us, MobileNets avoid massive computation of im2col reordering. ird, since MobileNets generate relatively small models and require comparatively few parameters, conventional anti-overfitting measures are adjusted. For instance, less regularization and data augmentation are used. Additionally, minimal weight decay (L2 regularization) is adopted on the depthwise filter. Fourth, two hyperparameters called width multiplier and resolution multiplier are applied to further shrink the model size.
e core concept of [49] is factorizing a conventional convolution to lower the computation complexity. is factorization does not affect the inference accuracy and thus is a lossless simplification method. However, lossy simplification is necessary if superior simplification effect is demanded. Samraph et al. customize DL network to match FPGA platform [39].
is method simplifies the weight matrix through clustering and encoding. Additionally, matrix-vector multiplication operations are factorized to decrease computational complexity. First, elements of the weight matrix are clustered by k-means into K clusters. us, every element is affiliated to a cluster, and the center of every cluster is the mean of its affiliated elements. Consequently, every element in the weight matrix is replaced with the corresponding center. In other words, every weight is approximated with the center of its affiliated cluster. Second, the approximate weights are encoded with a bit width of log K. And all cluster centers form a dictionary vector. As a result, encoding can significantly lower memory overhead.
ird, the matrix-vector multiplication can be factorized due to the fact that the encoded matrix has abundant repetitive elements. erefore, the number of floating-point multiplication operations is dramatically reduced, which means lower computational complexity. In addition to the aforementioned three basic steps, this method faces another problem: replacing weights with cluster centers inevitably induces numerical error to the DL network. is error can affect the inference accuracy. e method of [39] adopts two solutions to handle this error. One is increasing the length of the dictionary vector (in other words, designating a larger K to k-means). e other is to iteratively cluster and retrain the weights. e method of [39] focuses on compressing the already trained weight matrix. By contrast, methods like lasso regularization can sparsify the weight matrix during training [52].
Lane et al. propose a software framework named DeepX to reshape the DNN reference model under limited resource constraints [53]. By contrast to the clustering method of [39], DeepX uses SVD decomposition and reconstruction error minimization to compress the DNN model. On the first level, they adopt SVD decomposition to reconstruct and approximate the weight matrix of every DNN layer. us, DeepX dramatically reduces the amount of DNN Mobile Information Systems parameters in each layer. Additionally, the accuracy of this approximation is measured and tuned in pursuant to the reconstruction error. As a result, this reconstruction method avoids the predicament of retraining. On the second level, DeepX quantizes the computation loads of every neuron and formalizes workload scheduling as a constrained dynamic programming problem. In this manner, computation load can be automatically scheduled onto processors to meet energy and time constraints.
Pruning or compressing an already-trained DNN could result in large approximation error [54][55][56][57]. One alternative is to train a sparse DNN. Lin et al. propose a method named structured sparsity regularization (SSR) to achieve weight matrix sparsification during training [58]. ey introduce two distinct structured-sparsity regularizers into the object function of matrix weight sparsification. ese two regularizers can constrain the intermediate status of DNN filter matrix to be sparse. Subsequently, they adopt an Alternative Updating with Lagrange Multipliers (AULM) scheme to alternatively optimize the sparsification objective function and minimize recognition loss. e SSR method enjoys significantly lower time and memory overhead than state-ofthe-art weight matrix pruning methods. Nazemi et al. propose a DNN training method to remove redundant memory access operations.
is method utilizes Boolean logic minimization [59]. In the training process, the sign function is adopted as the activation. Consequently, activations are confined to binary values. Every layer of the DNN (except the first layer and the last layer) is modeled as a multi-input multioutput Boolean function. In the inference process, outputs of the DNN are obtained through synthesizing a Boolean expression other than computing the dot product of the input and weight. In other words, enormous memory accessing operations are avoided, which removes vast memory access latency and energy consumption. e aforementioned algorithmic solutions focus on simplifying the DNN model so as to reduce MAC operations and memory consumption. Nevertheless, physical durability, especially energy efficiency, is still a daunting barrier to benefit various practical applications through deep learning. DeLight is a low-overhead framework that capacitates efficient training and execution of deep neural networks under low-energy constraints [60]. Authors of [60] restrain the DL network size through energy characterization in pursuant to pertinent physical resources. ey design an automatic customization methodology to adaptively fit the DNN into the specific hardware while inducing minimum degradation of learning accuracy. e core concept of DeLight is to project data to low-dimensional embeddings (subspaces) in a context-and-resource-aware manner. Consequently,  insights into data samples can be achieved through dramatically less neurons. Moreover, trained models in every embedding are integrated to enhance learning accuracy. e core concept of De Light is fine-grained energy consumption control based on data dimension reduction. e framework HyperPower proposes to bound energy and memory consumption from the point of hyperparameter optimization [41].
is is a hyperparameter optimization framework based on Gaussian process (GP) and Bayesian optimization [61,62]. is framework denotes test error as a function f(x), where x is a data point in the design space of hyperparameters. Additionally, power and memory overhead is denoted as a function g(x). Subsequently, hyperparameter tuning is formalized as an optimization problem: minimizing f(x) under the constraint that g(x) is lower than a threshold. Minimizing f(x) is costly due to the fact that f(x) has no close form. Consequently, HyperPower adopts GP to approximate distributions of f(x). Moreover, the framework leverages Bayesian optimization to iteratively select optimal hyperparameters and update distribution of f(x). f(x) is assumed to obey Gaussian distribution. Let y denote the observations of f(x). At the very beginning, an initial approximation of f(x) can be resolved as p M (y | x) based on the assumption and a set of known (x, y) values (Gaussian process regression). Every iteration includes the following operations. e primary task is to select an optimal value of x from the design space to refine p M (y | x). And the selected x should push the f(x) value along a direction of decrease. is value of x is identified through maximizing an expectation-improvement-based acquisition function. In addition, the acquisition function incorporates the constraint using an indicator function. e indicator function equals to one if the constraint is satisfied and zero if not. Second, the neural network is configured in accordance with the new design parameter (the newly identified x) and trained to obtain the test error (a new value of y). ird, the mean and covariance are updated using the new (x, y), and thus, p M (y | x) is updated to p M (y).

Computational Optimization.
Computational optimization relies on reengineering the algorithm implementation in accordance with a specific hardware architecture. Some conventional optimization techniques are code parallelizing, fine-tuning of parallel code, data caching, and fine-grained memory utilization.
Huynh et al. developed a tool DeepMon for continuous vision applications based on commodity mobile GPUs [37]. Large deep neural networks (DNNs) powered by commodity mobile GPU commonly cannot achieve strict real-time performance due to limited computational resources. However, the frame rate can be low (one to two frames per second) under some use cases, such as speaker recognition and elder nursing care. ese application scenarios put forward comparatively low demands on real-time performance. DeepMon implements large DNNs for such applications based on commodity mobile GPUs and achieves near real-time performance. In the aforementioned applications, first-person-view images are not apt to exhibit significant changes during a short time span. DeepMon divides each frame of image into equal-size blocks. DeepMon cached the intermediate results of each block when calculating the convolution of one frame. Subsequently, similar blocks are identified between this frame and the next frame. Consequently, the cached results can be directly utilized to calculate convolution of the next frame. Additionally, cached results expire after a certain time period. Similarity between two images is identified based on color distribution histogram and chi-square distance metric. In addition to this caching mechanism, DeepMon leverages Tucker-2 decomposition convolution layers [63] to factorize a traditional convolution layer into several small convolution layers. As a result, computation cost of convolution is reduced. Finally, DeepMon tunes GPU codes on various mainstream commodity mobile GPUs. Tuned and optimized GPU codes are encapsulated into separate kernels for each GPU model. As a result, DeepMon can adaptively adopt appropriate kernels at runtime so as to fit into a specific GPU with best efforts. e main idea of DeepMon is caching the intermediate result to eliminate redundant computation. Another typical technique is GPGPU acceleration. Cao et al. proposed a GPGPU-powered RNN model that executes locally on mobile devices [64]. Recurrent neural network (RNN) can find wide applications such as speech recognition and robot chatting. Traditional mobile applications of RNN generally offload main computation onto the cloud. However, the cloud-based implementation induces security and efficiency issues. Cao et al. pointed out that existing GPGPUaccelerated methods for convolutional neural network (CNN) cannot directly be transplanted to mobile-devicebased RNN. On the one hand, RNN inherently contains many sequential operations, which constrains the parallelism of RNN. On the other hand, existing GPGPU-powered RNN methods are specially designed for desktop GPGPUs. Such methods can not directly fit into mobile GPGPUs due to the fact that the mobile GPGPU possesses significantly less memory capacity and processing cores. In a RNN, the inevitable dependencies between adjacent cells dramatically increase the difficulty in exploiting parallelism among cells. Nevertheless, operations within a cell still exhibit considerable parallelism. In the work of [64], computation of the cell is factorized in fine granularity and elegantly fits into the mobile GPGPU. e adaptive platform DL framework Deep 3 still adopts the idea of GPGPU-powered computing. However, Deep 3 exploit parallelism from three levels: data, network, and hardware. e ultimate goal of Deep 3 is to bridge the gap between data science perspective design of deep learning and computer engineering perspective optimization of deep learning. First is hardware parallelism. Deep 3 extracts basic operations (layers) of a deep learning network, including convolution, maximum pooling, mean pooling, matrix multiplication, and nonlinearities. Optimized implementation of a basic operation can be dramatically distinct with regard to the hardware platform. For example, by altering the dimensionality of matrices, we can observe that matrix multiplication is computation-intensive or data-intensive on Mobile Information Systems a specific platform. Deep 3 employs subroutines to perform hardware profiling. Each subroutine runs a specific operation with varying sizes on different platforms, separately. In this manner, Deep 3 recognizes the optimal size of a specific operation regarding a target platform. ese optimal sizes are vital instructions to split an entire deep learning network into subnetworks, which adapt the computational, memory, and bandwidth resources of the target platform. Second is network parallelism. Deep 3 breaks down the entire deep learning network into overlapped subnetworks using a depth-first method. Each subnetwork has the same depth as the original network with significantly fewer edges. Every subnetwork can be independently updated, and such local updates are periodically collected by a parameter coordinator to optimize the entire network. ird is data parallelism. Deep 3 decomposes the high-dimensional input data into several low-dimensional subspaces through dictionary learning. Dictionary learning can be efficiently performed by machine learning algorithms like spectral clustering [65][66][67]. Subsequently, each subnetwork is dedicated to handling a specific subspace and different subspaces are processed in parallel.
Wu et al. exploit mobile deep learning in the joint perspective of software-and-hardware architecture. ey propose a platform named De epShark to capacitate commercial-off-the-shelf (COTS) mobile devices with the capability of adaptive resource scheduling [68]. Methods like DeepX try to compress the deep model. By contrast, DeepShark seeks trade-off between response speed and memory consumption. It splits a pretrained DNN into code blocks and incrementally runs the blocks on system-onchip (SoC) to accomplish inference. Consequently, DeepShark only needs to load currently required data from external storage into memory rather than hold entire data in memory throughout the execution period.
us, DeepShark remarkably lowers memory consumption. In addition, DeepShark induces no accuracy loss due to the absence of model compression or approximation. Moreover, privacy risks are avoided due to the fact that all userrelevant data are processed locally. Eventually, DeepShark is transparent to deep learning developers. It overloads default system functions of TensorFlow and Caffe. Developers can invoke DeepShark APIs in the same way as calling TensorFlow or Caffe APIs. By contrast, the work of [59] eliminates redundant memory operations in an algorithmic manner.

Hardware Revolution.
Haensch et al. point out that the aspiration to apply DL to all fields of daily life is an inheritage of pervasive computing. However, academia and industry are facing challenging barriers to scale DL to fit DL into pervasive applications [69]. Overhead is a vital problem regarding pervasive application of DL, where overhead refers to time and computational resources required to construct, train, and run the model. Prior-art research works show that GPUs take a step further towards pervasive DL, whereas it is confirmed that customized hardware dedicated to DL can outperform general-purpose GPUs.
Han et al. design a dedicated processor for DNN-based real-time object tracking [70]. is processor achieves low power consumption through a DNN-specific processor architecture and a specialized algorithm. However, this dedicated processor still relies on digital computing.
A DL network only requires limited kinds of mathematical operations (for example, matrix multiplication). And such operations frequently reoccur in model training or inference. ese two characteristics enable efficient execution of DL algorithms on not only GPUs but also analog computing circuits. Additionally, DL algorithms are highly tolerant to noise and uncertainty, which opens a way to trade numerical precision for algorithmic accuracy. Analog computing discussed by Haensch et al. [69] is an extension of in-memory computing. Prior-art nonvolatile memory materials cannot efficiently accommodate analog in-memory computing. Reengineering memory materials is a challenging task. A new generation of DL accelerating hardware has entered the vision of academia and industry. is kind of hardware trades versatility for low overhead. Nevertheless, complexity of constructing and training DL models is beyond the capacity of any single kind of hardware. As a result, researchers need to consider the solution in a systematic perspective and aggregate several kinds of accelerators into a perfect system. Vitality of new accelerators heavily depends on this issue. Moreover, Haensch et al. declare that analog accelerators will not completely replace the digital ones. Both digital and analog accelerators should be continuously developed to the maximum possible extent. e analog accelerators should be capable of seamless integration into digital ones.
Analog computing can be implemented based on electrochemical reactions. Such a mechanism has been investigated to establish hardware foundations for DLrelated problems. For example, neuromorphic computing can circumvent immanent performance bottlenecks of traditional computing via parallel processing and crossbarmemory-enabled data accessing. Fuller et al. link a redox transistor to a conductive-bridge memory (CBM) and thus establish an ionic floating-gate memory (IFG) array [71]. e working life of redox transistors can reach up to over one billion "read-write" operations. Additionally, data access frequencies can achieve more than one megahertz.
is IFG-based neuromorphic system shows that inmemory learning and inference can efficiently perform based on low-voltage electrochemical systems. e adaptive electrical features of IFG can hopefully pioneer neuromorphic computers that can significantly outperform conventional digital computers in power efficiency. Such neuromorphic analog computers could adjust deep learning to power-limited context, or even capacitate persistent lifelong learning of a product. Another electrochemistry-based hardware prototype is proposed in [72]. Tsushiya et al. design a solid-state ionic device to address decision-making issues like the multiarmed bandit problem (MBPs). is device opens a way to achieve decision-making through motion of ions, which could contribute to mobile artificial chips and find various applications including deep learning.
In addition to analog computing, photonic (or optical) computing is also a promising hardware solution. Currently, mainstream photonic computers replace components of electric digital computers with photonic equivalents, which can achieve higher speed and bandwidth. Some pioneering research works have adopted photonic computing to support DL-related computations. Rios et al. achieve all-photonic in-memory computations through combining integrated optics with collocated data storage and processing [73]. ey fabricate nonvolatile memory using the phasechange material Ge 2 Sb 2 Te 5 and perform direct scalar and matrix-vector multiplications based on this nonvolatile photonic memory. e computation results are represented by the output pulses. is photonic computing system offers a promising shift towards high-speed and large bandwidth on-chip photonic computing, which circumvents electrooptical conversions. Such a system could be the cornerstone of the purely photonic computers. Feldmann et al. point out that conventional computing architectures differentiate real neural tissue by physically separating the functionalities of data memory and processing [74].
is separated design places a daunting barrier to achieving high-speed and power-efficient computing systems like human brains. A promising solution to conquer this barrier is to elaborate novel hardware to simulate neurons and synapses of human brains. Consequently, they investigate on wavelength division multiplexing techniques to implement a photonic neural network based on a scalable circuit, which can mimic the neurosynaptic system in an all-optical manner. is circuit maintains the intrinsic high-speed and large bandwidth characteristics of an optical system and capacitates efficient execution of machine learning algorithms.
Quantum computing is another prospective solution to support DL. Gao et al. adopt a quantum generative model to design quantum algorithm of machine learning. is model enjoys superior ability of representing probability distributions over conventional generative models. In addition, the model can achieve a speedup of exponential magnitude at least in some application scenarios that a quantum computer cannot be fully simulated through conventional digital computing paradigm. e work of [75] opens a way to quantum machine learning and demonstrates a dramatic instance where a quantum algorithm of both theoretical and practical values can reach exponentially higher performance over conventional algorithms.
Novel hardware paradigms like ionic memory, photonic computing, and quantum computing could set indispensable stages for resource-limited deep learning. Despite that these hardware evolutions may be initially motivated by facilitating deep learning applications, the next-generation hardware could find much broader applications in future.
3.6. Discussion. Table 1 summarizes representative works in the perspective of underlying principles that account for the computational predicament of DNNs. Existing research works commonly aim at dealing with one or more of the causes of the computational predicament. e first is memory overhead induced by oversized network. Earlier algorithmic solutions tend to compress or prune the weight matrix of a pretrained DNN. Compressing or pruning is a trade-off between the capacity (or generalization ability) and memory efficiency. However, directly modifying a pretrained network inevitably results in unexceptable error. Despite that retraining is a choice, it will induce remarkable extra time overhead.
As a result, recent algorithmic solutions propose to achieve a sparse network through training. e core idea is to elaborately select a regularization item for the error function, which forces the network to form sparse weight matrices yet at little or even no loss in generalization ability. In addition to algorithmic solutions, digital computers can also capacitate large pretrained networks in the inference phase through fine-grained utilization of memory. e second is time or energy overhead induced by backpropagation, memory operations, and hyperparameter tuning. From the point of algorithmic view, dramatic redundant computation can be eliminated, especially in matrix-matrix or matrix-vector multiplications. In this manner, time overhead as well as energy consumption is reduced. Time efficiency can also be promoted by reusing intermediate results of convolution, parallelization on digital processors, and code fine-tuning on digital processors. Unlike overhead caused by arithmetic processing, time consumption induced by memory operations is difficult to handle. e reason is that traditional digital computers adopt von Neumann architecture and thus have independent processing and memory units. Due to the statistical and approximate nature of DNNs, Boolean logic minimization can contribute to the reduction of memory operations, as well as energy consumption.
is solution achieves efficient performance in handwritten digital recognition. However, it confines the activation functions to be sign functions, which limits the generalization ability. Regarding energy-related hyperparameter tuning, mathematical methods like Gaussian process can point out a more efficient searching path in the parameter space, other than merely rely on human experience or even random searching.
Energy consumption is mainly caused by arithmetic processing and memory operations. Consequently, the latter two are key problems. Regarding time overhead, most existing solutions focus on periphery issues like redundant computations. However, the problem roots in stochastic gradient descent. e training time will drop dramatically if we could fabricate an improved gradient that can lead to convergence more rapidly. With regard to memory operation overhead, it is an inherent problem of the von Neumann architecture. Resolving this problem requires new computing paradigms like in-memory computing. e third is the curse of dimension. Conventional solutions like weight matrix decomposition and data embedding can reduce the feature dimension. As far as we know, there are limited research works of feature dimension reduction in the computational-resource-limited context. Relevant topics are to be investigated.
It should be noted that the above discussed aspects are not isolated to each other. A systematic view may imply a Mobile Information Systems more efficient solution. For example, a pretrained sparser network undoubtedly demands less inference time than a denser network. Another instance, reading/writing weights will induce less time and energy consumption if the weight matrix is sparser. Table 1 does not cover innovative computing paradigms like analog computing and quantum computing. We will discuss such computing paradigms in more detail later. Table 2 provides more details on the representative research works. ree categories of solutions are all under rapid development. e overall motivation is to apply DL to mobile/embedded context efficiently. Algorithmic solutions are at the core position due to the fact that they directly cope with business logic of real applications and aim to reduce time and memory complexity on the mathematical logic layer. Existing solutions mainly focus on simplifying matrix-and-vector operations, data/network embedding, hyperparameter tuning, and sparsification through regularization. Further research is still needed to explore reducing computational overhead through activation function.
In addition to the mathematical logic layer, traditional general-purpose digital hardware bridges the gap between mathematical algorithms and real applications. To the best of our knowledge, most practical mobile/embedded DL-based applications are based on traditional hardware. In this case, classical computational optimization methods can be adopted to fully utilize computational resources, including data caching, parallelization, and code fine-tuning. However, many existing DNNs are designed by AI experts, who place little or even no concern on the adaptiveness of DNNs to hardware. As a result, the DNNs may need some reshaping to efficiently fit into a specific hardware device. In view of this, we expect that researchers can design DNNs in a joint view of both AI experts and computer engineers.
Currently, representative computational performance metrics include memory overhead, memory access latency, parallelism (full utilization of processors), and power consumption. However, some topics still remain to be investigated. For instance, DeepShark uses external storage as the cache to support fine-grained memory utilization. Power consumption caused by data I/O is to be discussed. In addition, the balance between cache size and cache hit rate is also an interesting topic. Table 3 shows the datasets that were used to evaluate a DNN in pursuant to more than one performance metrics. ese datasets and relevant algorithms are favourable choices to serve as benchmarks.
Nevertheless, traditional general-purpose digital hardware may be still inefficient under certain scenarios. Consequently, DL-dedicated digital hardware is becoming increasingly popular, whereas the computational performances of digital hardware are facing bottleneck due to physical constraints. Next-generation computing technologies such as quantum computing are promising solutions to conquer such constraints. Next-generation computing technologies will undoubtedly boost the progress of deep learning even if they are now in their infancies.

Challenges to Be Addressed
Despite the promising prospect of existing solutions, we are still facing some considerable challenges to be addressed.

Fundamental Support for Hardware Revolution.
Analog computing is a promising technology to facilitate DL due to the fact that DL is tolerant to numerical errors.

Representative research works Techniques
Memory overhead induced by oversized network [39] Weight matrix compression of a pretrained network through clustering: merging similar functions in the hypothesis space [56] Weight pruning of a pretrained network: removing the weights that contribute little to fitting functions in the hypothesis space [39,58] Sparse training: lasso regularization, structured sparsity regularization [68] Computational optimization on digital computers: finegrained utilization of memory Time or energy overhead induced by backpropagation, memory operations, and hyperparameter tuning [37,39,49] Algorithmic design to avoid computation redundancy: depth separable convolution, avoidance of im2col reordering, factorized matrix-vector multiplication based on SVD and Tucker-2 [37] Caching of digital computers: reuse intermediate results of convolution to avoid redundant computation [39,40] Parallelization on digital processors: FPGA, GPGPU [37,40,53] Full utilization of digital processors: profiling and fine-tuning of CPU or GPGPU codes [59] Avoidance of frequent memory operations through Boolean logic minimization [41] Hyperparameter tuning using Gaussian process Curse of dimension [53] SVD decomposition of the weight matrix [60] Data embedding  Compared to the other innovative types of hardware, analog computing is temporarily taking the leading position. e analog array technology has been successfully applied to DNN to processing common datasets [108], while other innovative hardware technologies such as photonic computing and quantum computing are still to be applied to DNN [73,75]. Superiority of the analog array lies in the fact that it adopts analog circuit to compute matrix-vector multiplication with constant time overhead irrelevant to size of the matrix. However, it is a predicament to straightly map convolutional neural network onto conventional analog arrays due to the fact that kernel matrices are commonly small and the constant-time multiplication operation has to be iterated for many times in a sequential manner. Rasch et al. parallelize the training through duplicating the kernel matrix of a convolution layer on different analog arrays and stochastically dispatching parts of the computation onto the arrays. As a result, the speedup ratio is proportional to the amount of kernel matrices per layer [106].
In addition to the high speed-up ratio, another advantage of analog computing is the splitting of processing and memory. Under a traditional von Neumann architecture, processing units and memory are separate. Data transmission between processing units and memory can consume orders of magnitude more energy than conventional arithmetic operations. In addition, a typical deep learning application routinely demands enormous data transmission operations, which raises dramatically higher energy consumption than that of computation. One promising solution is collocating processing units and memory using phasechange memory [109].
Despite that analog computing hardware has exhibited promising potential to outperform traditional von Neumann architecture hardware like GPUs, most existing research works focus on the functionality of such analog hardware. Efficiency and reliability issues like stability and durability are yet to be investigated before moving out of the lab to real applications [110].

More Efficient Algorithmic
Solutions. Some algorithmic solutions like weight matrix compression and weight matrix decomposition are approximating the original pretrained neural network with a simplified one. Nevertheless, empirical nature of DL hinders solving an exact theoretical upper bound of approximation error. e absence of this upper bound makes it difficult to prove the robustness of such approximations. Additionally, due to the lack of theoretical principles, many algorithmic techniques require iterative tuning and running the model to select the optimal one. Nevertheless, the design space of model parameters is large. As a result, implementing such algorithmic techniques to large-scale real applications may be a daunting task, especially when we need to deal with hyperparameters within large ranges.
Posttraining simplification of the DNN may result in large error. Moreover, a large number of parameters hinder the stochastic gradient descent to achieve a near-optimal solution. Sparse training is a promising method to cope with these two problems.
Achieving high capacity of a deep neural network is a conventional solution to guarantee low generalization error. However, most deep neural networks obtain high capacity through harnessing a large number of weights, which means dense connections between consecutive layers.
is explains the reason that many existing deep neural networks adopt fully connected layers. Nevertheless, real biological scale-free neural networks can significantly outperform state-of-the-art deep learning networks yet with sparse connections. Inspired by this observation, Mocanu et al. construct a sparse scale-free network topology with two consecutive layers [111].
is topology substitutes sparse layers for fully connected layers before training. eir sparse evolutionary training method quadratically decreases the amount of parameters, inducing no loss in accuracy. is sparse training method opens a way to lower the barrier to fitting deep learning into traditional hardware.
Based on the method of [111], Liu et al. train a sparse MLP (multiple layer perception) model with a million neurons to classify microarray genes [110]. is MLP model can be trained within the time of 101 seconds magnitude and achieve lower generalization error than traditional models  [41] Hyperspectral Remote Sensing Scenes Processor utilization rate [40] Power consumption [60] (dataset: Leukemia, dimension: 26, 1397 training data samples and 699 testing data samples). e method of [111] mainly focuses on building a novel network topology, yet still adopts conventional stochastic gradient descent to train the model [111]. Dettmers et al. harness exponentially smoothed gradients to recognize layers and weights that efficiently decrease the training error of a sparse model. As a result, the model can converge significantly faster. In addition, the trained network is insensitive to hyperparameters.
ese research works typically concentrate on sparse training of several types of DNN. In view of the diversity and complexity of DNNs, it is a highly valuable yet challenging job to exploit sparse training for various types of DNNs under specific application requirements.

Systematic Integration.
As discussed in Section 2, the ultimate goal of resource-limited DL is ubiquitous deployment of DL. Diversified applications can put forward various requirements on ubiquitous DL. As a result, we need to systematically integrate various types of solutions.
Next-generation computing hardware should seamlessly collaborate with traditional digital hardware, with the ultimate target of accommodating the tachytely evolving DNNs.
Gil and Green argue that the future computing hardware is based on intersections of three aspects: mathematics and information, neuron-inspired biology and information, and physics and information. ese intersections give rise to the concepts of digital computing, neural computing, and quantum computing, respectively. Gil and Green denote the three concepts as bit, neuron, and qubit, respectively. As shown in Figure 4, the next-generation AI-enabled computing system requires integration of the three [117]. In this figure, we adopt quantum computing (qubits) to represent future computing paradigms. Novel computing paradigms like analog computing should be also taken into consideration. We discuss this integration in detail as follows.

Digital Computing.
e advantage of digital computing lies in its stable binary nature. With the same binary input, a digital computing system should always generate the same output.
is nature is the cornerstone of building robust and stable systems for data storage and processing. Classical digital computing is still an efficient solution to not only mathematical and logical operations but also persistent data storage. In the future computing system, digital computing will still occupy an indispensable position due to its robust and reliable nature.

Neuron
Computing. Despite the advantages of digital computing, current DNN-based AI methods require reshaping or even innovating this computing paradigm. AI has achieved dramatic progress in the last decade. AI is still in the phase of narrow AI, which demands large amount of manually labeled data to acquire knowledge of specialized tasks. In the next phase, we are expecting the broad AI that can adaptively and autonomously adapt to diversified tasks of various domains. Narrow AI is already computationally expensive in enormous scenarios. e vision of broad AI will even aggravate the computational predicament. Building efficient computing systems for such AI workload requires innovative reengineering of materials, architecture, and software. e first category of solutions to AI-specific computing system stems from statistical and error-tolerant nature of deep learning. Such solutions sacrifice numerical precision for computational performance, yet generally achieve similar or even equivalent classification accuracy to the fullprecision implementations [118][119][120][121]. We will witness a continuous decline in the precision demands of DNN training and inference in the coming decade. is trend is driven by the constant renovations of AI-specific digital hardware and matching algorithms, which will result in significant improvement in the performance of AI hardware.
As is previously discussed, another category of solutions lies in the idea of eliminating the overhead of data transmission between processing units and memory.
We can envision the high demands raise by DNN-based AI in the near future. Quantum computing enjoys the greatest computing power among almost all existing computing paradigms and thus has the potential to boost hightime-complexity deep learning applications that are knotty to the other computing paradigms.

Quantum Computing.
Quantum computing generates an exponential state space of qubits (quantum bit) through exploring quantum superposition and entanglement. Computing power exponentially scales with the number of qubits: one additional qubit means doubled computing power. Prototypes of quantum computers have come out in the lab of hardware vendors like IBM [122,123]. e next topic is to bridge the gap between the technical prototype and real applications. For instance, quantum error correction (QEC) codes are indispensable for fault-tolerant quantum computing. Quantum computers will be a core accelerator of future AI-enabled computing systems. Nevertheless, currently, the cost of building a fault-tolerant quantum computing is beyond the reasonable range [124]. Further in-depth investigation is urgent.

4.3.4.
Integration of Bits, Neurons, and Qubits. As aforementioned, a deep-learning-enabled computing system relies on three cornerstones: digital computing (bits), neural computing (neurons), and quantum computing (qubits). Systematic solutions to computational-resource-limited deep learning will require the integration of bits, neurons, and qubits. Bits can provide fundamental data storage and guarantee the robustness of underlying hardware. However, bits alone can only support programmed tasks for specific narrow purposes.
Integrating neurons with bits generates narrow AI or even broad AI, which can not only distill insightful knowledge from unimaginably huge amount of data but also assist humans in a collaborative and more humanlike manner. Various science and engineering problems are hopefully to be resolved with the assistance of AI. e core principle of a neural network is to search a function in the hypothesis space of the network and thus map a category of samples to a corresponding output label. Due to the large scale and complexity of science and engineering problems, a typical neural network necessarily requires a high capacity to generate a large hypothesis space. A large hypothesis space can possibly contribute to reducing the generalization error. Nevertheless, a large hypothesis space means more degree of freedom and demands a long time to let the stochastic-gradientdescent-impelled backpropagation find an approximation to the optimal solution. e exponentially scaling computing power just matches the time overhead of the similar order of magnitude.
Digital hardware like GPGPU and FPGA currently account for the mainstream accelerator of DNNs. Timeconsuming manual fine-tuning of parallel code is an unavoidable operation to achieve optimal performance, with regard to every "DNN model-GPGPU type" pair. As a result, digital-hardware-accelerated DL is facing a barrier to efficient and agile programming. Moreover, the developing toolkit of analog-computing-enabled or quantum-computing-based deep learning is undoubtedly an essence when we someday handover analog computers or quantum computers to investigators, programmers, and computing resource providers.

Conclusion
In this paper, we investigate typical solutions of resourcelimited deep learning and point out the open problems.
Existing solutions have achieved successes under specific scenarios. However, we expect future breakthroughs in the following two aspects. e first aspect is dedicated hardware.
Most existing solutions depend on general-purpose digital hardware. Dedicated hardware, which takes into consideration unique characteristics of deep learning, is a promising direction to achieve further performance enhancements. e second aspect is the theoretical principles of deep learning. Simplifying the DNN is almost an inevitable method to reduce resource consumption. Nonetheless, such methods currently rely on empirical and iterative tuning. Additionally, the robustness of simplification is not theoretically guaranteed. Clarifying the theoretical principles of deep learning will enable more efficient simplification and guarantee robustness. Disclosure e funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Conflicts of Interest
e authors declare no conflicts of interest.

Authors' Contributions
Chunlei Chen conceptualized the study. Jiangyan Dai and Huihui Zhang were responsible for resources. Chunlei Chen and Peng Zhang prepared the original draft. Huixiang Zhang, Yugen Yi, and Yonghui Zhang reviewed and edited the manuscript.