Network pruning is a useful approach for reducing memory space and bandwidth. Pruning approaches were created in the early 1990s to decrease a big network that has been trained into a tiny network without necessitating retraining [15]. This made it possible to use neural networks in limited contexts, such as embedded devices and in electronic components. Pruning eliminates unnecessary neurons or parameters that have no bearing on the correctness of the output [16]. When the weight factors are zero, near to zero, or repeated, this circumstance can occur in the network. Pruning decreases the computation cost as a result. If pruned networks are retrained, they may be able to escape a prior local minimum and enhance accuracy even more. The two types of network pruning research are sensitivity computation and penalty-term approaches [17, 18, 64–70].
Recent research has shown advances in both network pruning classes and a combination of them. New network pruning strategies have developed in recent years. Various characteristics of contemporary pruning procedures may be categorised that includes:
-
Structured and unstructured pruning dependent, if the pruned network is symmetric or not,
-
Neuron and associative pruning dependent on the pruned component type, or
-
Dynamic and static pruning.
All pruning stages are conducted offline prior to any inference in static pruning, whereas dynamic pruning is done in real time environment. While there is some duplication among the classifications, we shall classify network pruning strategies using dynamic and static pruning in this research. Element-by-element, row-by-row, column-by-column, filter-by-filter, and layer-by-layer pruning are all some forms of pruning. Generally, element-by-element partitioning has the least influence on the model's design, resulting in an unstructured design [70–76].
$$\text{arg}\underset{p}{\text{min}}L=NT\left(x;W\right)-{NT}_{p}\left(x;{W}_{PR}\right) where {NT}_{p}\left(x;{W}_{PR}\right)=P\left(NT\left(x;W\right)\right)$$
Pruning can be mathematically expressed as in the equation above, which is regardless of category. The full neural network using an as input is represented by NT, which consists of a sequence of layers (e.g., convolutional layer, pooling layer, etc.). In comparison to the unpruned network, PN symbolises the pruned network with an NTp loss in performance. The accuracy of categorization is a common measure of network performance. The PR(.)(pruning function) produces a new network configuration NTp as well as pruned weights WPR. The impact of PR(.) on NTp is the focus of the following section and acquiring of WPR is also considered [77–81].
Static Pruning
After training and before inference, static pruning is a network optimization approach that eliminates neurons from the network. There is no further network trimming done during inference. Static pruning usually consists of three steps: 1) choosing which parameters to prune, 2) deciding how to prune the neurons, and 3) fine-tuning or retraining if necessary. Retraining the pruned network to attain equal accuracy to the unpruned network may increase performance, but it may take substantial offline computing time and energy [19].
Dynamic Pruning
Dynamic pruning decides which layers, channels, or neurons will not be active in the future at runtime. By taking use of altering input data, dynamic pruning can overcome the limits of static pruning, possibly decreasing computation, bandwidth, and power consumption. In most cases, dynamic pruning does not undertake runtime fine-tuning or re-training. The decision method that determines what to prune is the most critical consideration [20].
Table 1
Comprehensive Analysis of Network Pruning
Approach | Pruning Strategy | Inference |
Static Pruning | Magnitude based Pruning | It has been postulated and largely recognised that trained weights with big values are more essential than those with lower values. The cornerstone to magnitude-based approaches is this finding. Magnitude-based pruning strategies aim to discover and eliminate unnecessary weights or features from runtime assessment. Unused values can be trimmed in both the kernel and the activation map. Pruning all zero-valued weights or all weights within an absolute value threshold is the most intuitive magnitude-based pruning strategy [21]. |
Filter-wise Pruning | It employs the l1 -norm to eliminate filters that have no impact on classification accuracy. On the CIFAR-10 dataset, pruning whole filters and their accompanying feature maps lowered inference costs by 34% for VGG-16 and 38% for ResNet-110, with better accuracy of 0.75 percent and 0.02 percent, respectively [22]. |
Penalty based Pruning | The purpose of penalty-based pruning is to adjust an error function or add other limitations to the training process, known as bias terms. Some weights are updated to zero or near zero using a penalty value. After that, the values are trimmed [23]. |
Element-wise Pruning | Unstructured network organisations may occur from element-by-element pruning. As a result, sparse weight matrices are generated, which are difficult to process on instruction set computers. Furthermore, without specialist hardware assistance, they are frequently difficult to compress or speed. Group LASSO compensates for these inadequacies by employing a systematic pruning strategy that eliminates whole groups of neurons while keeping network topology [24]. |
Group-wise brain damage | It also added the LASSO restriction for groups, but only for filters. This creates sparsity and simulates brain damage. On the VGG Network, it achieved a 2 speedup with a 0.7 percent ILSVRC-2012 accuracy loss [25]. |
Network Slimming | It uses LASSO on the BN scaling factors. The activation is normalised using statistical parameters collected during the training phase via BN. Slimming the network has the impact of bringing forward-invisible extra parameters without adding overhead. Channel-wise pruning can be enabled by setting the BN scaler parameter to zero. On ILSVRC-2012, they achieved an 82.5 percent size reduction with VGG and a 30.4 percent calculation compression without losing accuracy [26]. |
Sparse Structure Selection | It's a slimming approach based on a generalised network. It prunes neurons, groups, and residual blocks by applying LASSO to sparse scaling factors. Using an upgraded gradient approach, Accelerated Proximal Gradient (APG), the suggested method achieves 4 speed-up on VGG-16 with 3.93 percent ILSVRC-2012 top-1 accuracy loss without fine-tuning [27]. |
Pruning combined with Tuning or Retraining | Deep Compression | It explains how to prune connections that don't help with classification accuracy using a static technique. They eliminate weights with tiny values in addition to feature map trimming. They re-train the network after pruning to increase accuracy. This process is repeated three times, resulting in a decrease of 9 to 13 total parameters with no loss of precision. The majority of the characteristics that were eliminated came from FCLs [28]. |
Recoverable Pruning | In most cases, items that have been pruned cannot be retrieved. As a result, network capacity may be lowered. Recovering network capabilities necessitates extensive retraining. To retrain the network for deep compression, it took millions of iterations. Many techniques use recoverable trimming algorithms to circumvent this flaw. The trimmed components may also play a role in the future training phase, adapting to the reduced network [29]. |
Soft Filter Pruning | It also used a filter dimension to extend recoverable pruning. SFP was able to achieve structured compression results with the added benefit of a shorter predictable times. Moreover, SFP may be employed on networks that are difficult to compress, with a 29.8% speedup on ResNet-50 and a 1.54 percent ILSVRC-2012 top-1 accuracy loss. In comparison to Guo's recoverable weight approach, SFP uses the structure of the filter to produce inference speedups that are closer to theoretical findings on generic hardware [29]. |
Auto pruner | As a distinct training-friendly layer, it incorporates the pruning and fine-tuning of a three-stage pipeline. The layer assisted in the steady pruning of the network during training, resulting in a less complicated network. With 2.39 percent ILSVRC2012 top-1 loss, AutoPruner pruned 73.59 percent of compute operations on VGG-16. ResNet-50 resulted in a 65.80% reduction in computational operations and a 3.10 percent reduction in accuracy [30]. |
Dynamic Pruning | Conditional Computing | Conditional computing entails turning on a portion of a network that is optimal without turning on the full network. Pruning is the term used to describe the process of removing non-active neurons from the brain. They are not included in the final result, minimising the amount of calculations necessary. Training and inference are both affected by conditional computing [31]. |
Reinforcement Learning | Adaptive networks strive to speed up network inference by detecting early exits on a conditional basis. Thresholds can be used to make a trade-off between network computation and accuracy. Multiple intermediate classifiers are used in adaptive networks to allow for an early departure. An adaptive network is a cascade network. Cascade networks are made up of many serial networks with output layers instead of per-layer outputs. Cascade networks offer the benefit of an early departure since all output layers do not need to be calculated. Inference might be transmitted to a cloud device if the early accuracy of a cascade network is insufficient [32]. |
Differential Adaptive Networks | Because the majority of the aforementioned choice components are nondifferential, RL is used for training. Using differentiable approaches, a variety of strategies have been developed to minimise training complexity [33]. |
Pruning methods vary widely and are difficult to compare. In [34], there is a single benchmark system aimed at comparing pruning performance. The value of the pre-trained weights is a point of contention. The pruned model might be trained from scratch using a random weight initialization, according to [35]. This means that the trimmed architecture is critical to success. The pruning algorithms might be viewed as a form of NAS as a result of this discovery. Because weight values may be retrained, the author determined that they are ineffective on their own. However, only when the weight initialization was identical to the unpruned model did the lottery ticket hypothesis [36] attain equal accuracy. The disagreement was overcome in [37] by demonstrating that the pruning form is what truly important. Unstructured pruning, in particular, can only be fine-tuned to restore accuracy, but structured pruning may be taught from the ground up. They also looked into the effectiveness of dropout and l0 regularisation. Simple magnitude-based trimming performed better, according to the findings. They created a magnitude-based pruning technique and demonstrated that the pruned ResNet-50 outperformed SOTA with the same computational complexity.