Providing clear pruning threshold: A novel CNN pruning method via L 0 regularisation

Network pruning is a signiﬁcant way to improve the practicability of convolution neural networks (CNNs) by removing the redundant structure of the network model. However, in most of the existing network pruning methods l 1 or l 2 regularisation is applied to parameter matrices and the manual selection of pruning threshold is difﬁcult and labor-intensive. A novel CNNs network pruning method via l 0 regularisation is proposed, which adopts l 0 regularisation to expand the saliency gap between neurons. A half-quadratic splitting (HQS) based iterative algorithm is put forward to calculate the approximation solution of l 0 regularisation, which makes the joint optimisation problem of regularisation term and training loss function can be solved by various gradient-based algorithms. Meanwhile, a hyperparameters selection method is designed to make most of the hyperparameters in the algorithm can be determined by examining the pre-trained model. The results of experiments on MNIST, Fashion-MNIST and CIFAR100 show that the proposed method can provide a much clearer pruning threshold by widening the saliency gap, and achieve a similar or even better compression performance, compared with the state-of-the-art studies.

In the line of regularisation, l 1 [22][23][24] and l 2 [17,[25][26][27][28] norm are mainstream choices. He et al. [20] stated that the effectiveness of norm-based pruning criterion depended on two requirements: (1)the norm deviation of the filters should be large, in other words, the numerical gap of norm between the important and unimportant filters should be large. (2)the minimum norm of the filters should be small. However, applying l 1 or l 2 regularisation in a parameters matrix will let all the parameters close to zero, which could hardly increase the saliency gap and results in more difficulties in adjusting the layer-wise pruning ratios in subsequent trimming steps. It should be noted that only the unimportant neurons other than all the neurons need to be regularised to zero and the others should be free from regularisation to find better values.
In general, pruning algorithms are iteration-based, which have two iteration strategies: global iteration and layer-by-layer iteration. Global iterative pruning [17,24] can globally consider the sensitivity of each layer to get the optimal pruning ratio, but the conventional l 1 and l 2 norm in global regularisation hardly increase the saliency difference between the important and unimportant neurons, which makes it hard to trim too many neurons in a single iteration and the trimming ratio of each layer needs to be carefully adjusted. For large depth models that require multiple pruning, the manual selection of layerwise pruning ratio at each iteration is a huge workload. In opposite, the layer-by-layer pruning method [14,15] can reach the best pruning result in some layers but prone to over-cutting, especially in the first few layers. Once previous layers are overpruned, the following needs more parameters to build a stronger classifier to compensate for the loss of feature information. Because the pruning sensitivity and saliency distribution in different layers are diverse, to obtain the globally optimal solution, layer-wise regularisation coefficients and thresholds need to be carefully selected, which is often labor-intensive.
In order to solve the shortcomings of the previous methods in the saliency gap and the difficulty of parameters tuning, we propose to use l 0 norm to regularise the model parameters. l 0 regularisation constrains some parameters to 0 while having no effect on other parameters, which makes some parameters free from regularisation to widen the saliency gap. Xu et al. [29] used HQS algorithm to solve l 0 regularisation problem in the field of image filtering. However, classical HQS is hard to use in the field of network pruning directly due to the obscureness of hyperparameters and the differences in objective function of regularisation in different application areas.
In this paper, an HQS-based iteration algorithm is proposed to get the approximate solution of l 0 regularisation problem in the field of network pruning. A parameter selection method is also presented, which transforms ambiguous hyperparameters in the classical HQS algorithm into parameters that can be determined by directly observing the parameter distribution of the pre-trained model. Meanwhile, a strategy of iterative regularisation and one-shot trimming is adopted to further reduce the workload of parameters tuning: First, l 0 regularisation is applied to the pre-trained model to make the saliency distribution of neurons as sparse as possible and increase the number of zerosaliency neurons. Then, all the neurons are globally trimmed according to their saliency only once. Finally the pruned model is retrained to compensate for loss of accuracy.
Our major contributions are as follows: i. An iterative HQS based algorithm is firstly proposed to find the approximate solution of l 0 regularisation in the field of network pruning, which can widen the saliency gap between important and unimportant neurons and provide clear pruning threshold for each layer. ii. A general hyperparameters selection method is presented.
Some obscure parameters in HQS can be transformed into the new parameters set that can be obtained according to the observation of the pre-trained model histogram. We study the influence of each parameter in the new parameter set on the regularisation effect and give a general range for each parameter. ii. The strategy of global regularisation and one-shot trimming does not require sensitivity analysis and the multiple manual selections of layer-wise pruning ratios. It makes the proposed method can be directly applied to various network models and datasets without many hyperparameter attempts.
The rest of the paper is organised as follows. Related works are described in Section 2. In Section 3, the proposed work method is described, including l 0 regularisation, the calculation of hyperparameters and pruning strategy. Experiments on MNIST, Fashion-MNIST, and CIFAR100 dataset are introduced in Section 4. The conclusion is drawn in Section 5.

l 1 or l 2 regularisation based methods
Wen et al. [23] proposed a structured sparsity learning (SSL) method to regularise the substructures in network architecture. In their work, the parameters belonging to the same structure were regularised by group lasso regularisation. Liu et al. [24] introduced a scaling factor for each output channel, which was jointly trained with the network weights and constrained by l 1 regularisation. In the trimming step, all the channels with the factor smaller than a global threshold were removed. As we discussed earlier, these types of l 1 or l 2 regularisation based method are hard to offer a clear pruning threshold, which makes the selection of layer-wise pruning ratios labor-intensive and often needs iterative pruning. Luo et al. [15] proposed a filter-level pruning method called Thinet, where the filters were pruned layer-by-layer according to its saliency without sparse regularisation. He et al. [20] proposed a saliency metric called geometric median (GM) and achieved state-of-the-art results, where the saliency metric was the distance between the filter and the geometric centre of the filters set. For this type of method, the layerwise pruning ratios are often difficult to be determined due to the gaussian distribution of saliency metric. To achieve the ideal compression ratio on large models, a lot of attempts are usually inevitable.

FIGURE 1
Simplified flowchart of our regularisation algorithm s is the number of filters in layer1 and z is the number of weights in every filter. Saliency Hist is the histogram of the saliency vector elements. First, calculate the saliency of each filter. Then calculate the auxiliary variable: Setting the filter parameters in the weight matrix smaller than the regularisation radius to 0 such as w 12 , while keeping the other unchanged. Finally, the distance between the weight matrix and the auxiliary matrix is jointly optimised with training loss. During the iteration process, the regularisation radius is automatically decreased

l 0 regularisation based methods
Louizos et al. [30] proposed a method based on parameter prior distribution to use l 0 regularisation in network training, which can be regarded as a global pruning method with layer-wise parameter. This method can achieve network compression during network training, but its compression rate is not as good as the method based on the pre-trained model. It is already difficult to select hyperparameters from the pre-trained model, from scratch can only be more difficult. Lin et al. [31] proposed structured sparsity regularisation (SSR) that integrated the l 2,1 , l 1 , l 2,0 regularisation and achieved state-of-the-art performance in many models. In their work, l 0 sparse regularisation is firstly solved by alternating direction method of multipliers (ADMM), but most of the hyperparameters in their method are model-related. Moreover, the method of SSR is a layer-by-layer threshold-based pruning method. Once the filter is removed, it can not be restored. Thus, the layer-wise threshold should be selected carefully and the sensitive analysis before pruning is necessary. Meanwhile, the threshold in each layer is not controlled by an individual hyperparameter but two( i and ), which leads to the coupling of the pruning threshold and the coefficient of the regularisation term. Thus, to obtain a good result, it is necessary to select a set of suitable i whose element number is equal to the number of layers. It should be pointed out both the selection of hyperparameters and the layer-by-layer pruning strategy are time-consuming.

l 0 sparse regularisation
The simplified flowchart of our regularisation algorithm is depicted in Figure 1. Conventionally, applying regularisation to a pre-trained model can be described as minimising the following objective function: where W denotes all parameters of the model. (W ) is the training loss function. is regularisation coefficient. p(⋅) is the regularisation function. i and g are the layer index and the number of layers, respectively. W i is the parameter matrix of layer, which is a 2-D parameter matrix. For a convolution layer, the shape of filter is [c out , c in , h, w], so the shape of The objective function of joint optimisation of saliency-based l 0 regularisation and training loss is: where j is the filter index. Φ(W i ) is the saliency vector, s i is its element number and related to regular granularity. (⋅) is the saliency function.
To simplify mathematical derivations, let filter saliency is only related to its parameters(e.g. l 1 norm), so we have the bottom equation in (3). It is worth mentioning that the saliency of filter i j might be not only related to its parameters w i j , and its definition is still an open question. The saliency of the 2 nd filter is smaller than regularisation radius. In the auxiliary variable version 1, all the filters are included, but only those smaller than the radius are included in version 2 Following the HQS principle [29] to solve this non-derivable problem, the objective function (2) is rewritten as follows: where W and T are two variables that need to be solved. and are both regularisation coefficients with positive values. The former is related to the balance between training loss and the similarity of (W i , T i ). The latter denotes the regularisation intensity of l 0 norm. T i is an auxiliary variable that is directly constrained by l 0 norm and has the same shape of W i . So far, (5) can be decomposed into optimising two parameters: W and T . Firstly, we get Φ(T i ) by using l 0 norm to constrain Φ(W i ), which is non-derivable. And then, minimising ‖Φ(W i ) − Φ(T i )‖ 2 2 to make every element in Φ(W i ) as similar as possible to its counterpart in Φ(T i ), which can be transferred to minimise Both regularisation term are convex and derivable(only if the value is zero, the latter is not), so it can be optimised by many backpropagation-based methods, such as gradient descent. However, the auxiliary variables T limits the change of all parameters, which might be harmful to the compensation of performance loss caused by regularisation by optimising other filters. We definite a variant of T called version.2 to verify this intuition at the end of Sec 3.1.1, which is shown in Figure 2.

Subproblem1: Optimising T
In objective function (5), T is only related to the second term.
In (3), we have assumed that the filter's saliency is only related to its parameters, so the T i j with different i j is irrelevant to each other and can be solved individually. Combing (5) and (3), the objective function of subproblem1 is: where L 2 is the second term of (5). W i j and T i j are the j -th elements of Φ(W i ) and Φ(T i ), respectively.
The closed-form solution of (8) is: The mathematical derivation is in Appendix A.1. In Appendix A.2, we also analysis the difference between approximate and exact solution, and how the proposed method reduces it. Thus, the corresponding t i j for each T i j can be obtained by: Finally, combining (12) and (10):

Weight-wise regularisation
For weight-wise pruning, the i j is the saliency of the j -th weight in layer i, and w i j is the value of the j -th weight in W i . A common choice of i j is the magnitude of weight value, so

Filter-wise regularisation
For filter-wise and channel-wise pruning, i j is the saliency of the j -th filter in layer i. w i j denotes the parameters set that includes all the parameters in the corresponding filter, which is the j -th row vector of matrix W i . If the saliency metric is l 2 norm, i j = ‖w i j ‖ 2 .

Batch normalisation regularisation
Batch normalisation is widely used in many network architectures, where every output channel x is normalised and then mapped to y by function y = ax + b. The scale parameter a can be used to minimise the channel outputs. Inspired by previous work [24], when the convolution layer i is followed by a batch norm layer, let the filter saliency i j in (13) equal to the corresponding scale parameter |a i j |.

Sparse normalisation
In general, the saliency distribution range in each layer is vastly different, which means that applying a global r in (10) might cause different constrain intensity in each layer. For example, if a layer's distribution range is smaller than r in all iterations, the corresponding T is always zero according to (13).
To precisely controlling the regularisation intensity in each layer as wish, normalising the saliency histogram of different layers to a similar distribution range is needed. To this end, (13) is replaced to (14).
where the variables with superscript (n) are iteration-varying, n is the current iteration number. W (0) i is the parameter matrix of the pre-trained model, D is the variance function.
In actual situations, the saliency vector in every layer is normalised using (14) once and then multiply by a layer-wise coefficient to scale to the same range. Using the following method to let most of the saliency values in the pre-trained model be small than 1 might be more directly: where . This method discards the average value of the saliency vector that might be related to the layer's saliency, but the layer's saliency can be obtained after the first sparse regularisation. After performing the same regularisation on each layer, the more sparse the layers, the less important.

Subproblem2: Optimising W
Solving W is equivalent to solving the following equation, which is formed by removing irrelevant variables in (5).
where we add a switch c i to control the regularisation strength of each layer. Generally, c i = 1. There are various algorithms to solve this kind of minimisation problem and Frobenius norm loss is convex, so it has no great influence on the original model convergence. In our work, we use gradient descent to optimise this function. According to (13) and Figure 2, when pushing the parameters in the range of regularisation radius to zero, the change of elements that out of the range is limited at the same time. On the one hand, it can protect the performance and saliency gap by preventing big value changing. On the other hand, this is not in line with the intuition that pruning would make the change of all neurons. This intuition can be tested by rewriting the (16) as follows(called version.2): where W ′ i is a subset of W i , which consists of the elements that the counterpart values in T i are zero. In other words, only the elements in the range of regularisation radius are regularised to zero.
Once the output channel j can be removed in the pruning step, all filters related to channel j in the next layer can also be removed. To avoid the weak channel output being amplified by the filters in the next layer, we regularise the filters with the low l 2 norm and the corresponding input filters in the next layer.

Parameters updating and terminating condition
The coefficient highly impacts the similarity between W and T , Once we get W from (16), will be update by (18), and then resolve T, W in order and so on.
The proposed sparse regularisation algorithm is summarised in Algorithm 1, where the superscript n ∈ [0, N ] represents the number of iterations, the variable with superscript 0 is its initial value, and with N is its ending value. The algorithm that use (16) in step 5 is called version.1, and (17) is called version.2.

The calculation of Hyperparameters
There are four hyperparameters in our method, which are (0) , (N ) , and . To simplify, lets: Compute saliency matrix Φ(W (n) ) from (3) 4: Compute T (n+1) from (14) 5: Compute W (n+1) from (16) where r (n) is the regularisation radius in iteration n, and is decreases with the increasing of iteration n, because: According to Algorithm 1, auxiliary variable T (n+1) is calculated by (14). From the mathematical point of view, when Meanwhile, because (N ) increases with n, the constraint ability of T (N ) to W (N ) is gradually enhanced, according to (16), which makes more small parameters in W (N ) closer to 0.
Let r (0) and r (N ) denote initialise and stopping radius, respectively. These two processes coordinate with each other and should follow the following principles: (i) In the beginning, a large r (0) can provide a wide parameter search range to find more potential, and a small (0) can balance larger regularisation losses to protect model performance. (ii) At the end of the iteration, small r (n) can provide more precise norm constraints, and higher (N ) is helpful to increase the number of zero elements in W (N ) . (iii) In order to protect the model performance, the ratio of regular loss to training error (the first and second term in (2), respectively) in the final iteration should be strictly limited.
According to principles 3, let k be the ratio of regularisation loss to training loss, according to (16): where k is the ratio of regularisation loss to training loss.
Assuming that model performance and training error are almost equal before and after pruning, the W (N ) in the right hand of (21) can be replaced by W (0) . According to (16) and (21), to protect the model performance, the (N ) should be smaller, so the overestimation of ‖W (N ) − T (N ) ‖ 2 F is necessary. The similarity of W and T is increasing with the itera-  (21) is replaced by (21) can be rewritten as follows: T (0) can be solved by combining (13)(19): According to principles 1, a large r (0) is beneficial to expand the search range and a typical value is to cover the whole range of parameter values. Generally, r (0) can be easily obtained by observing the histogram of parameters. In further, if the model parameters are normalised, it can even be independent of the model. Once r (0) is defined, (N ) and can be obtained by (23) and (24), respectively.
According to principles 2, r (n) is a small value, which can also be obtained from parameter histogram. Once r (n) is defined, (N ) can be calculated by (24). Assume the maximum number of iterations is N , we have: where the superscript N of represents the N -th power of , and (N ) of denotes the maximum number of iterations.
In HQS theory, small is good for getting close to the real solution of l 0 regularisation, the typical range is (1.1, 2]. The hyperparameter selection method is summarised in Algorithm 2.

Pruning
Various saliency metrics and pruning methods can be introduced to our algorithm framework. In our experiment, the As is shown in Figure 6, compared with the baseline model and conventional regularisation methods, proposed method possesses a higher saliency difference, which means that a simple pruning theory can also meet good results and can save us from the despaired loop of threshold selection and testing.

Experiments setup
The experiments included two parts: hyperparameters selection and pruning performance comparison. For the former, the fluctuation of pruning ratio with the wide range change of hyperparameters in pruning LeNet models on MNIST dataset are investigated and then the parameters selection principle is summarised. For the latter, the proposed method is compared with some state-of-the-art algorithms on CIFAR100 dataset, such as Liu's [24], ThiNet [15], SSL [23], SSR [31] and GM [20].

LeNet on MNIST
MNIST is a handwriting digital image dataset with 60k training data and 10k test data, which is suitable to implement principle test. In this section, we implement our method on LeNet and study the influence of each parameter on the saliency curve. The baseline model of LeNet-20-50-500 ('20-50-500' is the number of neurons in each layer) is same with the SSR's(github. com/ShaohuiLin/SSR) and we use Adam to optimise loss function with default parameters(learning rate is 0.001, 1 is 0.9, 1 is 0.999, epsilon is 10 −8 ). In this part, the standard choice of k, r (0) , r (N ) , are 2, 0.1, 1, 1.5, respectively. Only one parameter is changed at a time, and the others use the standard parameters. Figure 3a and Figure 3b show the variation of saliency curves with the increase of k and , respectively. The best range of k in here is [0.6,2], but even though k is overestimated 10 times(the value is 10), the pruning ratio is reduced by less 10%. In practice, k is the last parameter to be set, just make the regularisation loss with the same order of magnitude as the network loss in the first iteration. The typical range of k is [1,10]. With the decreasing of , the number of zero elements increases and the saliency gap is also expanding. The smaller the parameter value, the slower the regular radius shrinks, and the more filters small than radius in each iteration. As is shown in (25), the number of iterations is highly related to . The best range of is [1.1, 1.7], and 1.5 is a standard choice. Figure 3c and Figure 3d show the variation of saliency curves with the increase of r (0) and r (N ) , respectively. With the increasing of r (0) or the decreasing of r (N ) , the pruning size is increased and the gap is enlarged, but the number of iterations is also increased. The pruning size is more sensitive to r (0) than r (N ) , so the increment of pruning size that brings by increasing r (0) is more than r (N ) 's. As is shown in (25), the computing cost of increasing r (0) is lower than decreasing r (N ) due to the In this section, we compare our approach with SSR [31]. Our three error rates reported here are pruned from the same sparse model whose hyperparameters k, r (0) , r (N ) , are 2,0.9,0.1,1.5, respectively. We use the official code with default setting (the regularisation factor set is [0.4, 0.15, 0.02]) to implement SSR, which is 'SSR default'. And we also list the result reported in SSR's paper, which is 'SSR paper'. As we can see from Table 1, our method obtains a similar performance with the others. Because this experiment mainly focuses on parameter selection, we do not make an in-depth comparison of different methods.

LeNet on Fashion-MNIST
In this section, we use the same hyperparameters as before to carry out sparse regularisation and study the difference between version 1 and version 2. In our experiments, the error rate of the baseline model is 8.57%. To study the difference between auxiliary variables, we increase k from 0.2 to 2 and plot their respective significant curves in Figure 4. As is shown in Figure 4, with the same setting, the number of filters with a low l 2 norm in version 1 is slightly more than in version 2, which increases the prune size.
With the same loss of pruned accuracy(before finetune), version 1 can trim more filters than version 2, and the l 2 norm of the important filters in the results of version1 is slightly higher than that in the version 2. The reason is that, in the first few iterations of version 2, most of the filters are less than regularisation radius, so the l 2 norm of the filters out of radius is also decreased, which makes the saliency curve flatter. In ver-sion1, the filter will maintain the original l 2 norm, so the curve is steeper. Therefore, version 1 can remove more filters than version 2 when the radius is reduced. We also compare the proposed method with SSR in Table 2. For SSR, high factor means high cutting threshold. For example, when the factor in layer1 from 0.3 up to 0.4, the number of filters from 18 down to 10. The selection of regularisation factor set is not an easy job. For this pre-train model, if the factor of layer3 greater than 0.3, all the filters in layer3 are removed, but for MNIST dataset, the paper [31] reported that the factor set can be set to [0.5, 0.5, 0.5]. Comparing the last one of SSR's and the last one of Ours, the proposed method can obtain lower accuracy loss in the case of high compression ratio.

VGG19 on CIFAR-100
In this section, we apply our method in the variant of VGG19 on CIFAR-100 dataset and compare with Liu's work [24], SSR [31], SSL [23], ThiNet [15] and GM [20]. The variant is made by Liu et al. [32] and can be obtained in here(github. com/Eric-mingjie/rethinking-network-pruning). The main difference between it and the classic VGG19 is that it adds a batch  normalisation layer after each convolutional layer. Therefore, we can sparse the scale coefficients of the batch normalisation layers following the convolution layer or the filter's parameters itself.

Scale coefficients
Liu et al. [24] applied l 1 regularisation to the scale factors in batch normalisation layer, and used a global threshold in the trimming step for simplicity. For a fair comparison (although it is a little unfair to us), in this part, we first use our method to compress the model(we can get the layer-wise ratio from here) and then implement Liu's method with the same layer-wise pruning ratio. In this part, the hyperparameters k, r (0) , r (N ) , are 4, 1, 0.1, 1.4, respectively, which is similar to the setting in MNIST dataset. All the models are finetuned with same setting(learning rate is 0.01 and multiplied by 0.1 every 40 epochs) and the best accuracy is listed in Table 3. As is shown in Table 3, with the same pruning ratio, our method has a significant advantage in pruned accuracy owe to the wide saliency gap. Specially, when the 'Size' is less than 20.93, Liu's models have completely lost its classification ability. Although the accuracy is low to 0.97%, it can still recover to a acceptable level after retraining, but there is still a gap with our method. High pruned accuracy means the high confidence of the optimal layer-wise pruning ratio. The inappropriate pruning ratio set will make the model still have a large loss of accuracy after retraining, which will be discussed in the next section. We can also find that our finetune accuracy is better when the model size is small(≤ 26.47). In 'Ours-L 1 ', some filters in the first few layers are removed, which makes the FLOPs is significantly less than other methods.
Global threshold-based pruning method is also implemented, but its accuracy is not listed in Table 3, because the filters in layer 10 to 14 are all removed when the pruned size is less than 29.11 as is shown in Figure 5b. The number of filters retained by this method at different compression level in each layer is compared with ours in Figure 5. In Figure 5b, we can find that most of the filters in layer 9-14 are removed, but layer 8 and 15 have little change. However, in Figure 5a, as the compression level increases, these layers still retain a certain number of filters. To illustrate this problem, we depict the scale coefficients in layer 10 and 15 in Figure 6a and Figure 6b. As we know, the filters have low scale coefficient in batch normalisation layer can be removed, and we find that the scale coefficients in layer 10 are much lower than that in layer 15. Following global threshold-based pruning method, to obtain a batter compression rate, most of the filters in layer 10 to 14 are removed, so the inter-layer feature transmission is highly blocked, which leads serious performance drop. Comparing with keeping the layer with a small number of filters, removing the whole layer might be a better choice, but it is hard to practices. As is shown in Figure 5b, following the global threshold principle, all the filters from layer 10 to 14 can be removed, it is hard to decide the number of layers should be retrained, because removing all of them will cause serious performance problems and the depth of network is important. On the other hand, if we use a layer-wise pruning ratio, as is shown in Figure 6c and Figure 6d, it is difficult to decide pruning ratio when using l 1 and l 2 regularisation.
We use different regularisation terms to regularise the model, and then record the respective saliency curves in the Figure 6. As is shown in Figure 6, l 1 and l 2 regularisation are failed to widen the saliency gap in this instance, even the l 2 regularisation can hardly change the distribution of parameters, so it is very difficult to find an ideal pruning threshold without many attempts. Comparing with them, proposed method can make more parameters to be zero and widen the saliency gap.
In Figure 6, it is worth noting that l 1 regularisation looks better than l 2 , however, in (16), the distance between W and T is reduced by minimising ‖W − T ‖ 2 F , so we also implement a version of ‖W − T ‖ 1 (it is 'Ours-L 1 ' in Table 3). To study the difference in saliency caused by different distance definitions, we use different distance definitions for sparse regularisation, while keeping other parameters unchanged, and then draw the saliency curve in Figure 7. As is shown in Figure 7, in terms of pruning speed, the l 1 version is better than l 2 . To explain, from the perspective of regularisation, when the parameter is small, the parameter converges to 0 faster under the l 1 norm. And Table 3 shows that when the number of pruned parameters is 19.06% of pre-trained, pruning step does not cause any loss of accuracy, but the finetune accuracy is worse than standard Frobenius version. This result is also predictable. The l 1 regularisation can set a parameter to zero but l 2 regularisation will prevent the parameter from becoming too small. When the filter is set to 0, back propagation will not change his parameters, which is equivalent to hard removal. Meanwhile, the l 1 norm converges faster, which is prone to over-pruning.  The layer-wise pruning ratios in 'GM*', 'SSL*' and 'SSR' are automatically calculated, and the others are the same as ours.
In contrast, the l 2 regularisation can keep the parameter value at a low value for a long time so that the filter deleted by mistake can be restored. Table 4 shows the result of SSL, ThiNet, GM, SSR and ours. The layer-wise pruning ratio used by 'SSL', 'ThiNet' and 'GM' are the same as we used in the previous section, except for the results marked with '*'. In SSL the regularisation is applied to the l 1 norm of filter and its input channels. So we re-regularise the pre-trained model and the accuracy of sparse model is 72.69. The characteristics and defects of SSL are similar to Liu's method. Comparing with the Liu's result in Figure 3, under the same compression level, the results of SSL are slightly better than Liu's. It is unclear whether this is caused by differences in regularisation or differences in saliency metric. 'SSL*' is the result of global threshold-based pruning method. In a similar model size, the FLOPs of SSL* are lower than ours. (In our method, when the size is 26.47%, FLOPs are 67.95%. In SSL*, the size is 27.66% and FLOPs are 65.05%) But its accuracy, model size and FLOPs are significantly worse than others result, because many filters in layer 4-8 are removed by mistake.

Filter's parameters
ThiNet and 'GM' are regularisation-free methods, so we directly apply them to the pre-trained model. The layer-wise pruning ratio in these two methods needs to be set before pruning. These two methods are both pruning according to their respective saliency metrics, but ThiNet updates the value of retained parameters to minimise the reconstruction error in further. So the Pruned accuracy of ThiNet are higher than of GM is predictable. 'GM*' simply pruned all the weighted layers with the same pruning ratio at the same time in the filter pruning step. However, the filters in layers 4 to 8 in this model are critically important and have less room to be pruned, so the result is expected. SSR removes more filters in the first few layers but keeps more in the later. So it removes more FLOPs and less number of parameters. We also apply our method to the l 2 norm of filters. In order to obtain a more sparse model, we regularise multiple times in the regularisation stage and obtain the sparse models with 72.08% and 72.13% accuracy. In the pruning stage, all filters with l 2 norm less than 0.01 are removed. As is shown in the penultimate line of Table 4, the pruned model can be reduced to 16.8% of the sparse model nearly without accuracy loss. With the similar finetuned accuracy, ours model size and FLOPs are significant less than the other methods.
In Figure 8, we draw the saliency curves of the last layer under different saliency measures for the pre-trained, SSL and ours sparse model, respectively. As is shown in Figure 8a, the saliency distribution in the pre-trained model is often average regardless with saliency metric. The conventional l 1 and l 2 norm are hard to widen the saliency gap, which is not a big problem when the number of zero filters is large. If not, the smooth saliency curve can not provide a clear pruning threshold, just like Figure 6c and Figure 8b. Figure 8c shows that our method can significantly increase the saliency gap of 'L1', 'L2' and geometric median (GM). The BN curve has a jump before 500, which is the same as other metric.
In Figure 9, we draw the mean of the absolute value of the filter parameters under different regularisation methods before retraining. SSL applies l 1 regularisation to all the parameters, so the mean value is obvious less than the pre-trained model. As is shown in Figure 9b, when we remove all filters with l 2 norm less than 0.2, the model size is 33.73% but accuracy is 57.98%, which is worse than using ours pruning ratio (size is 33.20% and accuracy is 72.45). Our method has both lower low values and higher high values in layer 8-15, so the value gap is significant. Thus, even all filters with l2 norm less than 0.01 are removed.

CONCLUSION
A novel CNNs pruning method via l 0 regularisation is proposed, in which an HQS based iterative algorithm is proposed to calculate the approximation solution of l 0 regularisation for the model parameters. Thus, it can be jointly optimised with training loss by gradient-based methods. The experiment results showed that compared with l 1 and l 2 regularisation, the proposed method could increase the numerical difference between the important and unimportant neurons, which means the proposed method can provide a very clear pruning threshold and is beneficial to the threshold selection of iterative pruning. For the testing datasets, there was only 0.01% accuracy loss after removing more than 80% of the model parameters, which indicates that although the strategy of multi-regularisation and one-shot pruning is used, the model accuracy and compression ratio can still be competitive.
Based on the assumption in (3), the T i j with different i and j is irrelevant to each other. So the solution of every T i j in (A.1) can be solved independently by: (A.5) So the closed-form solution of (A.2) is: (A.6)

A.2 Difference between approximate solution and exact solution
The objective function is: Let the exact solution of (A.7) isT . To ensure that the parameter W can converge to the exact solutionT in subproblem2, the T should be equal toT . Under this ideal conditions, the subproblem2 is: Dividing the second item into two parts: ) . (A.9) where the parameters with subscript (≠ 0) are retrained and (= 0) are removed, soT i,=0 = 0. Because the T we get is inaccurate, so the T ≠0 and T =0 are both possible inaccurate. Considering the following function: where the real solution of W ≠0 isT ≠0 . In this equation, if the second regularisation term is removed, it can also reach the best result by minimising training loss. Let the W ≠0 free to find better value without any regularisation, it is possible to reachT ≠0 . This is the reason why we proposed auxiliary variable version2.
More importantly, the previous analysis is based on the premise that T ≠0 andT ≠0 are the same set of parameters. In other words, the pruning ratio needs as accurate as possible.
In proposed algorithm, we use a wide regularisation radius in the beginning to find more potential and gradually shrink to a narrow range to improve pruning accuracy in the end, which can provide a clear pruning threshold in each layer and finally improve the accurate of pruning ratio.