Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning

The success of convolutional neural networks (CNNs) in various applications is accompanied by a significant increase in computation and parameter storage costs. Recent efforts to reduce these overheads involve pruning and compressing the weights of various layers while at the same time aiming to not sacrifice performance. In this paper, we propose a novel criterion for CNN pruning inspired by neural network interpretability: The most relevant elements, i.e. weights or filters, are automatically found using their relevance score in the sense of explainable AI (XAI). By that we for the first time link the two disconnected lines of interpretability and model compression research. We show in particular that our proposed method can efficiently prune transfer-learned CNN models where networks pre-trained on large corpora are adapted to specialized tasks. To this end, the method is evaluated on a broad range of computer vision datasets. Notably, our novel criterion is not only competitive or better compared to state-of-the-art pruning criteria when successive retraining is performed, but clearly outperforms these previous criteria in the common application setting where the data of the task to be transferred to are very scarce and no retraining is possible. Our method can iteratively compress the model while maintaining or even improving accuracy. At the same time, it has a computational cost in the order of gradient computation and is comparatively simple to apply without the need for tuning hyperparameters for pruning.


Introduction
Deep CNNs have become an indispensable tool for a wide range of applications [1], such as image classification, speech recognition [2], natural language processing [3], chemistry [4], neuroscience [5], medicine [6] and even are applied for playing games such as Go, poker or super smash bros [7].They have achieved high predictive performance, at times even outperforming humans.Furthermore, in specialized domains where limited training data is available, e.g., due to the cost and difficulty of data generation (medical imaging from fMRI, EEG, PET etc.), transfer learning can improve the CNN performance by extracting the knowledge from the source tasks and applying it to a target task which has limited training data [8].
However, the high predictive performance of CNNs often comes at the expense of high storage and computational costs, which are directly related to energy expenditure demanded by the elaborate architecture of the fine-tuned network.These deep architectures are composed of millions of parameters to be trained, leading to overparameterization (i.e.having more parameters than training samples) of the model [9].The run-times are typically dominated by the evaluation of convolutional layers, while dense layers are cheap but memory-heavy [10].For instance, the VGG-16 model [11] has approximately 138 million parameters, taking up more than 500MB in storage space, and needs 15.5 billion floating-point operations (FLOPs) to classify a single image.Note that overparametrization is helpful for an efficient and successful training of neural networks, however, once the trained and well generalizing network structure is established, pruning can help to reduce redundancy while still maintaining good performance [12].
Reducing a model's storage requirements and computational cost becomes critical for a broader applicability, e.g., in embedded systems, autonomous agents, mobile devices, or edge devices [13,14,15,16].Neural network pruning has a decades long history with interest in both academia and industry [12,14] aiming to detect and eliminate the subset of the network elements (i.e.weights or filters) that include connections and neurons less important w.r.t. the network's intended task.For network pruning, it is very crucial to decide how to quantify the "relevant" subset of the parameters in the current state for deletion.To address this issue, previous researches have proposed specific criteria based on for instance Taylor expansion, weight, gradient, etc. to reduce complexity and computations costs in the network and related works are introduced in Section 2.
From the practical point of view, the full capacity (in terms of weights and filters) of an overparameterized model may not be required, e.g., when 1. parts of the model lie dormant after training (i.e., are permanently "switched off"), 2. a user is not interested in the model's full array of possible outputs, which is a common scenario in transfer learning (e.g. the user only has use for 2 out of 10 available network outputs), or 3. a user lacks data and resources for fine-tuning and running the overparameterized model.
In these scenarios the redundant parts of the model will still occupy space on disk and in system memory, and simultaneously information will be propagated throughout the redundant parts of the model, consuming energy and increasing runtime.Thus, criteria able to stably and significantly reduce the computational complexity of deep neural networks broadly across applications would be highly versatile for all practitioners.
In this paper, we propose a novel pruning framework based on layer-wise relevance propagation (LRP) [17].LRP was originally developed as an explanation method to assign importance scores, so called relevance, to the different input dimensions of a neural network that reflect the contribution of an input dimension to the models decision and has been applied to different fields of computer vision (e.g., [18,19]).The relevance is backpropagated from the output to the input and hereby assigned to each element of the deep model.Since relevance scores are computed for every layer and neuron from the model output to the input, these relevance scores essentially reflect the importance of every single element of a model and its contribution to the information flow through the network -a natural candidate to be used as pruning criterion.The LRP criterion can be motivated theoretically through the concept of deep Taylor decomposition (c.f.[20]).Moreover, LRP is scalable and easy to apply, and has been implemented in software frameworks such as [21].Furthermore, it has linear computational cost like backpropagation.
We systematically evaluate the compression efficacy of the LRP criterion compared to common pruning criteria on two different scenarios.

Scenario 1:
We focus on pruning of pre-trained CNNs including subsequent fine-tuning.This is the usual setting in CNN pruning and requires a sufficiently large amount of data and computational power.Scenario 2: We focus on another scenario, in which a model was pre-trained and needs to be transferred to a related problem but the data available for the new task is too scarce for a proper fine-tuning and/or the time consumption, computational power or energy consumption is constrained.Such transfer learning with restrictions is common in mobile or embedded applications.Our experimental results on various benchmark datasets and two different popular CNN architectures show that the LRP criterion for pruning is more scalable and efficient and leads to better performance than existing criteria regardless of data types and model architectures if retraining is performed (scenario 1).Especially, if retraining is prohibited due to external constraints after pruning, the LRP criterion clearly outperforms previous criteria on all datasets (scenario 2).Finally we would like to note that our proposed pruning framework is not limited to LRP, but can be also used with other explanation techniques.
The rest of this paper is organized as follows: Section 2 summarizes related works for network compression and introduces the typical criteria for network pruning.Section 3 describes the framework and details of our approach.The experimental results are illustrated and discussed in Section 4. Section 5 gives conclusions and an outlook to future work.

Related Work
Research efforts have been made in the field of network compression and acceleration.For instance, network quantization methods have been proposed for storage space compression by decreasing the number of possible and unique values for the parameters [13,22,23].Tensor decomposition approaches decompose network matrices into several smaller ones to estimate the informative parameters of the deep CNNs with low-rank approximation/factorization [24,25,26,27].More recently, [28] also propose a framework of architecture distillation based on layer-wise replacement, called LightweightNet for memory and time saving.Algorithms for designing efficient models focus more on acceleration instead of compression by optimizing convolution operations or architectures directly [29,30].
Network pruning approaches remove redundant or irrelevant elements -i.e.nodes, filters, or layers -from the model which are not critical for performance [12,14,31].Network pruning is robust to various settings and easily and cheaply gives reasonable compression rates while not (or minimally) hurting the model accuracy.Also it can support both training from scratch and transfer learning from pre-trained models.Early works have shown that network pruning is effective in reducing network complexity and simultaneously addressing over-fitting problems.Current network pruning techniques make weights or channels sparse by removing non-informative connections and require an appropriate criterion for identifying which elements of the model are not relevant for solving a problem.Thus, it is very crucial to decide how to quantify the relevance of the parameters (i.e.weights or channels) from the current sample in stochastic learning for deletion without sacrificing predictive performance.In previous studies, pruning criteria have been proposed based on the magnitude of their 1) weights, 2) gradients, 3) Taylor expansion/derivative, and 4) other criteria, as described in the following section.
Taylor expansion: Early approaches towards neural network pruning -optimal brain damage [12] and optimal brain surgeon [32] -leveraged a second-order Taylor expansion based on the Hessian matrix of the loss function to select parameters for deletion.However, computing the inverse of Hessian is accompanied by high computational effort.The work of [33,34] used a first-order Taylor expansion as a criterion to approximate the change of loss in the objective function as an effect of pruning away network elements.
Gradient: [35] proposes a sparsified back propagation approach for neural network training using the magnitude of the gradient to find essential and non-essential features in MLP and LSTM models, which can be used for pruning as a criterion.[36] proposed a hierarchical global pruning strategy by calculating the mean gradient of feature maps in each layer.It is adopted between the layers with similar sensitivity.
Weight: A recent trend is to prune redundant, non-informative weights in pre-trained CNN models.[37] and [38] proposed pruning the weights whose magnitude is below a certain threshold and to subsequently fine-tune with a l 1 -norm regularization.This pruning strategy has been used on fully-connected layers and introduced sparse connections with BLAS libraries, supporting specialized hardware to achieve its acceleration.In the same context, Structured Sparsity Learning (SSL) added group sparsity regularization to penalize unimportant parameters by removing some weights [39].[40] proposed a one-shot channel pruning method using the l 1 norm of weights for filter selection, provided that those channels with smaller weights always produce weaker activations.Recently, channel pruning alternatively used LASSO regression based channel selection and feature map reconstruction to prune filters [41].[42] performed structured pruning in convolutional layers by considering strided sparsity of feature maps and kernels to avoid the need for custom hardware and uses particle filters to decide the importance of connections and paths.In contrast to previous pruning studies for deep deterministic models, [43] proposed a pruning approach for deep probabilistic models by using the mask of the weights.[44] proposed a fusion approach to combine with weight-based channel pruning and network quantization.[45] proposed a weight-based network pruning approach using the modified l 1/2 penalty to increase the sparsity of the pre-trained models.More recently, [46] proposed evolutionary paradigm for weight-based pruning and gradient-based growing to reduce the network heuristically.
Other criteria: [47] proposed the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses before the softmax, classification layer in the network.The method is based on -in contrast to our proposed metric -a layer-independent pruning process which does not consider global importance in the network.And [48,49] proposed thinet that is a data-driven statistical channel pruning technique based on the statistics information computed from the next layer.

LRP-Based Network Pruning
A feedforward DNN consists of neurons established in a sequence of multiple layers of neurons, where each neuron receives the input data from the previous layer and propagates its output to every neuron in the next layer, using a non-linear mapping.Network pruning aims to sparsify the elements by eliminating weights or filters that are non-informative (according to a certain criterion).We specifically focus our experiments on transfer learning, where the parameters of a network pre-trained on a source domain is subsequently fine-tuned on a target domain, i.e., the final data or prediction task [8].Here, the pruning procedure can be described as follows: Network Pruning: 1. Given a pre-trained model in the target domain 2. Define a pruning criterion 3. Repeatedly prune the network as follows: i.For each layer, a.For each element (weight/filter), evaluate the importance according to the pruning criterion (compute magnitudes) b. optional: Globally scale the magnitudes with regularization (e.g.l p -norm) ii. Sort the magnitudes for all the layers throughout the network iii.Prune the least important elements and their inputs and outputs iv.optional: Further fine-tune to compensate performance degradation

Stop pruning if the model is reduced to a desired amount of model size or performance
Even though most approaches use an identical process, choosing a suitable pruning criterion to quantify the importance of model parameters for deletion while minimizing performance drop (Step 3) is of critical importance, governing the success of the approach.

Layer-wise Relevance Propagation
In this paper, we propose a novel criterion for pruning neural network elements: the relevance quantity computed with LRP [17].LRP decomposes a classification decision into contributions called "relevances" of each network element to the overall classification score.When computed for the input dimensions of a CNN and visualized as a heatmap, these relevances highlight parts of the input that are important for the classification decision.LRP thus originally served as a tool to interpret non-linear learning machines and has been applied as such in various fields, amongst others for general image recognition [18], medical imaging [19] and natural language processing [50].However, the direct linkage of the relevances to the classifier output makes LRP not only attractive for model explaining, but can also naturally serve as pruning criterion (see section 4.1).
The main characteristic of LRP is a backward pass through the network during which the network output is redistributed to all elements of the network in a layer-by-layer fashion.This backward pass is structurally similar to gradient backpropagation and has therefore a similar runtime.The redistribution is based on a conservation principle such that the relevances can immediately be interpreted as the contribution that an element makes to the network output, hence establishing a direct connection to the network output and thus its predictive performance.Therefore, as a pruning criterion, the method is efficient (similar runtime as gradient backpropagation) and easily scalable to generic network structures.Independent of the type of neural network layer -that is pooling, fully-connected, convolutional layers -LRP allows to quantify the importance of elements throughout the network [40].

LRP-based Pruning
The procedure of LRP-based pruning is summarized in Figure 1.In the first phase, a standard forward pass is performed by the network and the activations at each layer are collected.In the second phase, the score obtained at the output of the network f (x) is propagated backwards through the network according to LRP propagation rules [17].In the third phase, the current model is pruned by eliminating the irrelevant neurons/filters and is further fine-tuned.
LRP is based on a layer-wise conservation principle that allows the propagated quantity (e.g.relevance for a predicted class) to be preserved between neurons of two adjacent layers.Let R (l) i be the relevance of neuron i at layer l and R (l+1) j be the relevance of neuron j at the next layer l + 1. Stricter definitions of conservation that involve only subsets of neurons can further impose that relevance is locally redistributed in the lower layers and we define R (l) i←j as the share of R (l+1) j that is redistributed to neuron i in the lower layer.The conservation property always satisfies where the sum runs over all neurons i of the (during inference) preceding layer l.When using relevance as a pruning criterion, this property helps to preserve its quantity layer-by-layer, regardless of hidden layer size and the number of iteratively pruned neurons for each layer.
At each layer l, we can extract node i's global importance as its attributed relevance R (l) i .In this paper, we specifically adopt relevance quantities computed with the LRP-α 1 β 0 -rule as pruning criterion.The LRP-αβ-rule was developed with feedforward-DNNs with ReLU activations in mind and assumes positive (pre-softmax) logit activations f logit (x) > 0 for decomposition.The rule has been shown to work well in practice in such a setting [51].The propagation rule performs two separate relevance propagation steps per layer: one exclusively considering activatory parts of the forward propagated quantities (i.e.all a (l) i w ij > 0) and another only processing the inhibitory parts (a (l) i w ij < 0) which are subsequently merged in a sum with components weighted by α and β (s.t.α + β = 1) respectively.
By selecting α = 1, the propagation rule expresses as where R (l) i denotes relevance attributed to i th neuron at layer l, as an aggregation of downwardpropagated relevance messages R i indicate the positive part of the weight filter and the (strictly positive) activation of the i th neuron at layer l, respectively.Note that a choice of α = 1 only decomposes w.r.t. the parts of the inference signal supporting the model decision for the class of interest.
Note that Equation (2) is locally conservative, i.e. no quantity of relevance gets lost or injected during the distribution of R j where each term of the sum corresponds to a relevance message R j←k .For this reason, LRP has the following technical advantages over other pruning techniques such as gradient-based or activation-based methods: 1. Localized relevance conservation implicitly ensures layer-wisely regularized global redistribution of importances from each network element.
2. By summing relevance over within each (convolutional) filter channel, the LRP-based criterion is directly applicable as a measure of total relevance per node/filter, without requiring a post-hoc layer-wise renormalization, e.g., via l p norm.
3. The use of relevance scores is not restricted to a global application of pruning but can be easily applied to locally and (neuron-or filter-)group-wise constrained pruning without regularization.Different strategies for selecting (sub-)parts of the model might still be considered, e.g., applying different weightings/priorities for pruning different parts of the model: Should the aim of pruning be the reduction of FLOPs required during inference, one would prefer to focus on primarily pruning elements of the convolutional layers.In case the aim is a reduction of the memory requirement, pruning should focus on the fully-connected layers instead.
Algorithm 1 provides an overview of the LRP-based pruning process.

Experiments
In the following, we will first attempt to intuitively illuminate the properties of different pruning criteria -that is: weight magnitude, Taylor, gradient and LRP -at hand of a series of toy datasets.if layer is 'conv' then 10: Compute filter-wise sum of R 11: Stack R in S for all j where S j < m × pr do 15: Remove the j th weight/filter and its consecutive connections Fine-tuning (optional) 18: end while 19: Output: pruned model f We then show the effectiveness of the LRP criterion for pruning on current and widelyused Image recognition benchmark datasets -i.e. the Scene 15 [52], Event 8 [53], Cats and Dogs [54], Oxford Flower 102 [55], Cifar 10 [56], and ILSVRC2012 [57] datasets -and two pre-trained feed-forward deep neural network architectures, the AlexNet and the VGG-16.
The first scenario focuses specifically on pruning of pre-trained CNNs with subsequent fine-tuning, as it is common in pruning research [33].We compare our method with several state-of-the-art criteria to demonstrate the effectiveness of LRP as a pruning criterion in CNNs.In the second scenario, we tested whether the proposed pruning criterion also works well if only a very limited number of samples is available for pruning the model.This is relevant in case of devices with limited computational power, energy and storage such as mobile devices or embedded applications.

Pruning Toy Models
Firstly, we systematically compare the properties and effectiveness of the different pruning criteria on toy datasets.In order to evaluate the criteria under the different data distributions, We tested four pruning criteria on three toy datasets ("moon", "circle" and "multi-class" (k = 4)), as generated using the respective functions from the Scikit-Learn machine learning toolkit for python [58] and shown in Figure 2. We generated a 2D dataset which consists of 1000 training samples and 5 test samples per class.We constructed the model as follows, During pruning, we pruned 1000 of 3000 neurons that have the least relevance for prediction according to each criterion.After pruning, we observed the change in the decision boundaries and re-evaluated classification accuracy using the original training samples across criteria.On the first toy example ("moon" dataset, first row in Figure 2), the original model accuracy is 99.9%.After pruning, the accuracies are 99.6% (weight), 74.5% (Taylor), 76.9% (gradient), and 99.85% (LRP), respectively.On the second example ("circle" dataset, second row), the original accuracy is 100.0%.After pruning, the accuracies are 97.1% (weight), 97.25% (Taylor), 75.05% (gradient), and 100.0%(LRP), respectively.On the third example ("multi-class" dataset, third row), the original model accuracy is 94.95%.After pruning, the accuracies are 91.0%(weight), 85.15% (Taylor), 84.93% (gradient), and 91.3% (LRP), respectively.(See the Table 1) These results show that among the considered pruning criteria, pruning with the relevancecriterion based on LRP best preserves the effective core of the example models, and thus prediction performance in every toy setting.
In contrast to the other heuristics-based pruning criteria, LRP directly relates relevance to the classification output, thus allows to safely remove the unimportant (w.r.t.classification) elements.Furthermore, Figure 2 shows that when pruning 1000 out of 3000 neurons, LRPbased pruning results in only minimal change in the decision boundary, compared to the other criteria.One can also clearly see that compared to weight-and LRP-based criteria, models pruned by gradient-based criteria misclassify a large part of samples.

Pruning Deep Image Classifiers for Large-scale Benchmark Data
The performance of the pruning criteria is evaluated on the CNNs VGG-16 and AlexNet pre-trained on ILSVRC2012 that are popular in model compression research [59].VGG-16 consists of 13 convolutional layers with 4224 filters and 3 fully-connected layers and AlexNet contains 5 convolutional layers with 1552 filters and 3 fully-connected layers.In dense layers, there exist 8192 + N (# of classes) of neurons, respectively.In terms of complexity of the model, the pre-trained VGG-16 and AlexNet on ImageNet originally consist of 138.36/60.97 million of parameters and 154.7/7.27GMAC for FLOPs, respectively [60].Experiments are performed within the PyTorch and torchvision frameworks [61] under Intel(R) Xeon(R) CPU E5-2660 2.20GHz and NVIDIA Tesla P100 with 12GB for GPU processing.We evaluated the criteria on six public datasets (Scene 15 [52], Event 8 [53], Cats and Dogs [54], Oxford Flower 102 [55], Cifar 10 [56], and ILSVRC2012 [57]).For more detail on the datasets and the preprocessing, see Appendix Appendix A. We fine-tuned the model for 200 epochs with constant learning rate 0.001 and batch size of 20.We used the Stochastic Gradient Descent (SGD) optimizer with momentum of 0.9.In addition, we also apply dropout to the fully-connected layers with probability of 0.5.Fine-tuning and pruning are performed on the training set, while results are evaluated on each test dataset.Throughout the experiments, we iteratively prune 5% of all the filters in the network by eliminating neurons including their input and output connections.In Scenario 1, we subsequently fine-tune and re-evaluate the model to account for dependency across parameters and regain performance, as it is common.

LRP vs. Weight Test Accuracy
Training Loss LRP vs. Gradient LRP vs. Taylor

Scenario 1: Pruning with Fine-tuning
On the first scenario, we retrain the model after each iteration of pruning in order to regain lost performance.We then evaluate the performance of the different pruning criteria after each pruning-retraining-step.
That is, we quantify the importance of each filter by the magnitude of the respective criterion and iteratively prune the 5% of filters (w.r.t. the original number of filters in the model) rated least important in each pruning step.Then, we compute and record the training loss, test accuracy, number of remaining parameters and total estimated FLOPs.We assume that the least important filters should have only little influence on the prediction and thus incur the lowest performance drop if they are removed from the network.
Figures 3 and A.8 depict the training loss (bottom) and test accuracy (top) as increasing the pruning rate in VGG-16 and AlexNet after fine-tuning for each dataset and each criterion.At each pruning iteration, we remove 5% of the weights/filters in the entire network based on the magnitude of the different criteria.It is observed that LRP achieves higher test accuracies as well as the lower training losses compared to other criteria for the entire range of pruning iterations on every dataset (see Figures 4 and A.7).These results demonstrate that the performance of LRP-based pruning is stable and independent of the chosen dataset.Apart from performance, regularization by layer must be a critical constraint which obstructs the expansion of the criterion toward several pruning strategies such as local pruning, global pruning, etc.Except for the LRP criterion, all criteria perform substantially worse without l p regularization compared to those with l p regularization and result in unexpected interruptions during the pruning process due to the biased redistribution of importance in the network.
Table 2 shows the predictive performance of the different criteria in terms of training loss, test accuracy, number of remaining parameters and FLOPs.Except for Cifar10, the highest compression rate (i.e.lowest number of parameters) could be achieved by the proposed LRP-based criterion (column "#Parameters" in Table 2).However, in terms of FLOPs, the proposed criterion only outperformed the weight criterion, but not the Taylor and Gradient criteria (see column "FLOPs" in Table 2).This is due to the fact that a reduction in number of FLOPs depends on the location where pruning is applied within the network: Figure 5 shows that the LRP and weight criteria focus the pruning on upper layers closer to the model output, whereas the Taylor and Gradient criteria focus more on the lower layers.
Throughout the pruning process usually a gradual decrease in performance can be observed.However, with the Event 8, Oxford Flower 102, Cifar 10 datasets, pruning leads to an initial performance increase, until a pruning rate of approx.30% is reached.This behavior has been reported before in the literature and might stem from improvements of the model structure through elimination of filters related to classes in the source dataset (i.e., ILSVRC2012) that are not present in the target dataset anymore [62].
Table A .4 and Figure A.8 similarly show that LRP achieves the highest test accuracy and lowest training loss for nearly all pruning ratios with almost every dataset.Furthermore, as discussed in Section 3, due to the widespread use of LRP, we do not take into account dataset and model type, which could be a powerful advantage going forward.
Figure 5 shows the number of the remaining convolutional filters for each iteration.We observe that, on the one hand, as pruning rate increases, the convolutional filters in earlier layers that are associated with very generic features, such as edge and blob detectors, tend to generally be preserved as opposed to those in latter layers which are associated with abstract, task-specific features.On the other hand, the LRP-and weight-criterion first keep the filters in early layers in the beginning, but later aggressively prune filters near the input which now have lost functionality as input to later layers, compared to the gradient-based criteria such as gradient and Taylor-based approaches.Although gradient-based criteria also adopt the greedy layer-by-layer approach, we can see that gradient-based criteria pruned the less important filters almost uniformly across all the layers due to re-normalization of the criterion in each iteration.However, this result contrasts with previous gradient-based works [33,35] that have shown that neurons deemed unimportant in earlier layers, contribute significantly compared  to neurons deemed important in latter layers.In contrast to this, LRP can efficiently preserve neurons in the early layers -as long as they serve a purpose -despite of iterative global pruning.

Scenario 2: Pruning without Fine-tuning
In this section, we evaluate whether pruning works well if only a (very) limited number of samples is available for quantifying the pruning criteria.To the best of our knowledge, there are no previous studies that show the performance of pruning approaches when acting w.r.t.very small amounts of data.With large amounts of data available (and even though we can expect reasonable performance after pruning), an iterative pruning and fine-tuning procedure of the network can amount to a very time consuming and computationally heavy process.From a practical point of view, this issue becomes a significant problem, e.g. with limited computational resources (mobile devices or in general; consumer-level hardware) and reference data (private photo collections), where capable and effective one-shot pruning approaches are desired and only little leeway (or none at all) for post-pruning fine-tuning strategies is available.
To investigate whether pruning is possible also in this scenario, we performed experiments with a relatively small number of data on the 1) Cats and Dogs and 2) ILVSRC 2012 datasets.On the Cats and Dogs dataset, we only used 10 samples each from the "cat" and "dog" classes to prune the (on ImageNet) pre-trained VGG-16 network with the goal of domain/dataset adaption.The binary classification (i.e."cat" vs. "dog") is a subtask of ImageNet and corresponding output neurons can be identified by its WordNet 1 associations.On the ILSVRC 2012 dataset, we randomly chose k = 3 classes for model specialization, selected n = 10 images per class from the training set and used them to compare the different pruning criteria.For each criterion, we used the same selection of classes and samples.In these experiments, we do not fine-tune the models after each pruning iteration, in contrast to Section 4.2.1.Performance is averaged over 20 random selections of classes and samples to account for randomness.Please note that before pruning, we firstly reconstructed output layers which has the size of N × k by eliminating the redundant network outputs of 1000 -k.
Furthermore, as our target datasets are relatively small and only have an extremely reduced set of target classes, the pruned models would still be very heavy w.r.t.memory requirements if the pruning process would be limited to the convolutional layers, as in Section 4.2.1.
More specifically, while convolutional layers dominantly constitute the source of computation cost, fully connected layers are proven to be more redundant [40].In this respect, we applied pruning procedures in both fully connected layers and convolutional layers.
For pruning, we iterate a sequence of first pruning filters from the convolutional layers, followed by a step of pruning filters/neurons from the model's fully connected layers.
Table 3 indicates the performances of each criterion for classifying a small number of classes (k = 3) from the ILSVRC2012 dataset.During pruning at fully-connected layers, no significant difference across different pruning ratios can be observed.Without further fine-tuning, pruning weights/filters at the fully connected layers can retain performance efficiently.
However, there is a subtle difference between LRP and other criteria with increasing pruning ratio of convolutional layers (LRP vs.Taylor with l 2 -norm: up to of 9.6 %, LRP vs. gradient with l 2 -norm: up to 28.0 %, LRP vs. weight with l 2 -norm: up to 27.1 %).
Moreover, pruning convolutional layers needs to be carefully managed compared to pruning fully connected layers.We can observe that LRP is applicable for pruning any layer type (i.e.fully connected, convolutional, pooling, etc.) efficiently.Additionally, as mentioned in 3.1, our method can be applied to general network architectures because it it can automatically measure the importance of weights or filters in a global (network-wise) context without further normalization.
Figure 6 shows the test accuracy and training loss as function of the pruning ratio, in context a domain adaption task towards the Cats&Dogs dataset.As the pruning ratio increases, we can see that even without fine-tuning, using LRP as pruning criterion can keep the test accuracy and training loss relatively not only stable, but close to 100% and 0 respectively, given the extreme scarcity of data in this experiment.In contrast, the performance decreases significantly when using the l 2 -norm based criteria (i.e.weight, gradient and Taylor).Initially, the performance is even slightly increasing when pruning with LRP.During iterative pruning, unexpected changes in accuracy with LRP (2 of 20 iterations) have been shown around 50 -55% pruning ratio, but accuracy is regained quickly again.By pruning 90% of convolutional filters in the network using our proposed method, we can have 1) greatly reduced computational cost, 2) faster forward and backward processing, and 3) a lighter model even in the small sample case, all while adapting an off-the-shelf pre-trained ImageNet model towards a dog-vs.-catclassification task.

Conclusion
Modern CNNs typically have a high capacity with millions of parameters as this allows to obtain good optimization results in the training process.After training, however, high inference costs remain, despite the fact that the number of effective parameters in the deep model is actually significantly lower (see e.g.[63]).To alleviate this, pruning aims at compressing and accelerating the given models without sacrificing much predictive performance.In this paper, we have proposed a novel criterion for iterative pruning of CNNs based on the explanation method LRP, linking for the first time two so far disconnected lines of research.LRP has a clearly defined meaning, namely the contribution of an individual network element, i.e. weight or filter, to the network output.Removing elements according to low LRP scores thus means discarding all aspects in the model that do not contribute relevance to its decision making.Hence, as a criterion, the computed relevance scores can easily and cheaply give efficient compression rates without further postprocessing, such as per-layer normalization.Besides, technically LRP is scalable to general network structures and its computational cost is similar to the one of a gradient backward pass.
In our experiments, the LRP criterion has shown favorable compression performance on a variety of datasets both with and without retraining after pruning.Especially when pruning without retraining, our results for small datasets suggest that the LRP criterion outperforms the state of the art and therefore, its application is especially recommended in transfer learning settings where only a small target dataset is available.
In addition to pruning, the same method can be used to visually interpret the model and explain individual decisions as intuitive relevance heatmaps.Therefore, in future work, we propose to use these heatmaps to elucidate and explain which image features are most strongly affected by pruning to additionally avoid that the pruning process leads to undesired Clever Hans phenomena [18].Also, we would like to further investigate LRP-based pruning with other modern neural network architectures such as GoogLeNet, ResNet, DenseNet, etc. in order to see how to effectively prune the weights/filters in the residual dense blocks of the desired networks.

Figure 1 :
Figure 1: Illustration of LRP-based sequential process for pruning.A. Forward propagation of a given image (i.e.cat) through a pre-trained model.B. Evaluation on relevance for weights/filters using LRP, C. Iterative pruning by eliminating the least relevant elements (depicted by circles) and fine-tuning if necessary.The elements can be individual neurons, filters, or other arbitrary grouping of parameters, depending on the model architecture.

Figure 3 :
Figure 3: Performance comparison of training loss and test accuracy in different criteria as pruning rate increases on VGG-16 with five datasets.

Figure 4 :
Figure 4: Performance comparison of the proposed method (i.e.LRP) and other criteria on VGG-16 with five datasets.Please note that weight cannot prune the model without l p regularization.

Figure 6 :
Figure 6: Performance comparison of pruning without fine-tuning for VGG-16 with small samples from Cats&Dogs dataset, as a means for domain adaption.

Table 3 :
Performance of pruning of convolutional and fully connected layers on VGG-16 without fine-tuning for a random subset of the classes from ILSVRC 2012 (k = 3) based on LRP(left top), Taylor(right top), gradient(left bottom), and weight(right bottom).

Table 1 :
Comparison of accuracy in each criterion with pruned models on toy datasets (moon, circle, and multi-class dataset) Figure 2: Performance comparison of the criteria on toy datasets (moon, circle, and multi-class dataset) 1st column: scatter plot and decision boundary of trained model, 2nd column: data samples for pruning, 3rd to 6th columns: changed decision boundaries of different criteria.

Table 2 :
Performance comparison of criteria (Weight, Taylor, Gradient with l 2 -norm, and LRP) on VGG-16 with five datasets.