Random pruning: channel sparsity by expectation scaling factor

Pruning is an efficient method for deep neural network model compression and acceleration. However, existing pruning strategies, both at the filter level and at the channel level, often introduce a large amount of computation and adopt complex methods for finding sub-networks. It is found that there is a linear relationship between the sum of matrix elements of the channels in convolutional neural networks (CNNs) and the expectation scaling ratio of the image pixel distribution, which is reflects the relationship between the expectation change of the pixel distribution between the feature mapping and the input data. This implies that channels with similar expectation scaling factors ( \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $\delta _{E}$\end{document}δE) cause similar expectation changes to the input data, thus producing redundant feature mappings. Thus, this article proposes a new structured pruning method called EXP. In the proposed method, the channels with similar \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $\delta _{E}$\end{document}δE are randomly removed in each convolutional layer, and thus the whole network achieves random sparsity to obtain non-redundant and non-unique sub-networks. Experiments on pruning various networks show that EXP can achieve a significant reduction of FLOPs. For example, on the CIFAR-10 dataset, EXP reduces the FLOPs of the ResNet-56 model by 71.9% with a 0.23% loss in Top-1 accuracy. On ILSVRC-2012, it reduces the FLOPs of the ResNet-50 model by 60.0% with a 1.13% loss of Top-1 accuracy. Our code is available at: https://github.com/EXP-Pruning/EXP_Pruning and DOI: 10.5281/zenodo.8141065.


INTRODUCTION
CNNs with deeper and broader structures provide higher performance for computer vision tasks.However, deeper models imply a larger number of FLOPs and parameters.For example, the original VGG-16 model (Simonyan & Zisserman, 2014) has hundreds of millions of parameters, and the 152-layer ResNet has gigabytes of FLOPs, which makes it difficult to deploy the models on mobile devices.To solve this problem, researchers have proposed various compression techniques for CNNs to reduce the FLOPs of the model.
The number of operations in the convolutional layer occupies 90% of the overall computation (Yang, Chen & Sze, 2017), so there is a large number of studies on the compression of convolutional layers.A simple approach is to construct the sparse convolutional layers by constraints (Wen et al., 2016;Lebedev & Lempitsky, 2016), and He et al. (2018) and Li et al. (2016) proposed a pruning method based on the norm.However, the method has a limited compression effect and does not provide significant speedup, and the weight pruning is an unstructured method that cannot be easily ported to mobile devices.Many researchers have continued to propose sophisticated solutions to the problem of exploring the importance and redundancy of filters.For example, reusing data samples to reflect the average rank (Lin et al., 2020a) and entropy (Wang et al., 2021a) of feature mappings obtained from filters to determine whether filters produce useless information; using the conditional accuracy variation associated with the results to assess the importance of each channel (Chen et al., 2020); calculating the classification contribution of filters to determine their importance and removing low importance filters (Zhang et al., 2022); using LASSO regression (He, Zhang & Sun, 2017) to sparse layer by layer and removing filters that are closer to the geometric median.These methods all investigate the properties that the filters have in order to explore the effect that the model internally produces on the results, however, the problem shown is that the pruning strategy is fixed leading to a large performance loss.
The above-mentioned filter property-based pruning methods do not need to add more time when developing pruning strategies.However, the adaptive pruning (Wang, Li & Wang, 2021;Liu et al., 2017;Huang & Wang, 2018), dynamic pruning, and architecture search-driven methods are subject to problems such as long pruning decision times.For example, using transformable architecture search to find the optimal size of small networks (Dong & Yang, 2019); using loss function descent to formulate weight movement rules (Sanh, Wolf & Rush, 2020); and combining meta-learning with architecture search (Liu et al., 2019).Meanwhile, Liu et al. (2018) pointed out that the effect of the adaptive pruning approach lies in the search for an effective network structure rather than the selection of important weights.Therefore, the stripe pruning proposed by Meng et al. (2020) differs from the previous structured pruning by using a filter skeleton to learn the optimal shape of the sub-network.Adaptive-based methods always add additional training conditions (using data samples or finding the best shape for a sub-network), which results in additional time costs.Therefore, the existing methods always have difficulty in finding a trade-off between high performance and simple strategies.
By analyzing the properties of the channel, this article proposes a random channel pruning method using the expectation scaling factor of the channel, and the overall pruning process is shown in Fig. 1.In the experiment, we found that for a single image sample, different channels with similar sums of matrix elements have similar altering effects on the pixel distribution of the sample, and these channels produce similar expectation ratios of pixel distribution between the feature mapping and the input data.That is, there is a linear relationship between the sum of matrix elements of the channels (unlike ' 1 -norm) and the d E of the data sample, as shown in Fig. 2.This article also assumes that the sub-networks that can effectively represent the original model capabilities are not unique, and all the parameters obtained from training play a role in the model.Therefore, this article does not select the important channels but removes the channels that produce similar effects, because randomly removing channels with similar d E reduces the redundancy of the model and decreases the excessive focus on local features.Moreover, the proposed EXP method is based on the model parameters, which focuses on the expectation changing effect of channels on data samples and does not introduce additional constraints, simplifying the complexity and calculation of pruning decisions.Many typical network structures, including VGGNet, ResNet, and GoogLeNet, are taken for extensive experiments on two basic datasets, namely CIFAR-10 and ImageNet ILSVRC2012.The results indicate that EXP outperforms popular pruning methods with significant compression and acceleration, and the random pruning strategy of the EXP proves that the selection of sub-networks is not unique.
In summary, the main contributions of this article are as follows.
1) Based on extensive statistical validation, it is demonstrated that for any data sample, there is always a linear relationship between the sum of matrix elements of the channels and d E .The focus of this article on channel properties shifts from the norm to a change in distribution expectations.

RELATED WORK
Most of the work on compressed CNNs can be divided into low-rank decomposition (Tai et al., 2015;Zhang et al., 2015), knowledge distillation (Cheng et al., 2017), quantization (Son, Nah & Lee, 2018), and pruning.Among them, pruning methods are simple and effective, and they are commonly used in model compression.To evaluate the degree of importance of filters, many empirical criteria are used to classify filters into those that contribute more and less to the network, such as based on model parameters, feature selection, parameter gradients, and architectural search-driven, as shown in Table 1.

Based on model parameter criteria
The method relies on a priori knowledge to determine redundancy with the assistance of parameters in the model.For example, pruning methods based on norm information have been proposed by many researchers (He et al., 2018;Li et al., 2016;He et al., 2019;He,  Zhang & Sun, 2017).Han et al. (2015) proposed an iterative method to remove small weights below a predefined threshold thus achieving sparsity.Guo, Yao & Chen (2016) proposed a dynamic pruning combined with restoration method where a partial weight is pruned and if the weight is found to be important at any time, the weight is restored.In practice, appropriate weight decay can alleviate the overfitting phenomenon, so using parameter magnitudes to determine redundancy is not reliable.In addition, the work (Wang et al., 2021b) proposed to use expectation and variance to directly calculate the similarity of filters.Zhao et al. (2019) proposed to extend the scale factor to shift terms to reformulate the batch normalization layer, estimate the channel saliency distribution and sparse it by variational inference.Yu et al. (2018) proposed to apply feature ranking techniques to measure the importance of each neuron in the final response layer and pruning to a binary integer optimization problem.Lin et al. (2021) used a message passing algorithm affinity propagation on the weight matrix to obtain an adaptive number of samples, which were then used as retained filters.Yang & Liu (2022) used the sum of the sensitivities of all weights in the filter to quantify the sensitivity of the filter and pruned the filter with lower 2nd order sensitivity.This type of method can simplify the pruning decision and reduce the computational complexity, and the correct prior knowledge is the key to guide the method.

Based on feature selection criteria
This method solves the filter redundancy problem by calculating the amount of information or similarity in the feature mapping.For example, data samples are reused to reflect the average rank (Lin et al., 2020a) and entropy (Wang et al., 2021a) of the feature mappings obtained from the filters to determine whether the filters produce useless  information.Similarity between filters is analyzed using color and texture histograms of feature mappings (Yao et al., 2021).Using the complexity and similarity of different samples to uncover the flow pattern information of the samples, the controller is used to process the input features and predict the saliency of the channels thus completing dynamic pruning (Tang et al., 2021).Work (Zhang et al., 2022) removes filters of low importance by calculating their classification contribution.FPC (Chen et al., 2022) uses singular value decomposition for feature mappings to evaluate their contribution and removes the lower contributing parts.This type of method also simplifies the pruning decision, but the pruned sub-networks may be biased towards some datasets, which reduces the generalization ability.

Based on parametric gradient criteria
The method fine-tunes the model to make changes in parameters or uses gradient information to complete pruning.Movement pruning (Sanh, Wolf & Rush, 2020) removes connections that gradually move away from 0 using 1st-order information of weights during fine-tuning.SNIP (Lee, Ajanthan & Torr, 2018) proposes a pruning method based on the importance of weight connections, which determines the importance of connections with different weights through gradient information in fine-tuning.Work (Molchanov et al., 2016) interleaves greedy criteria-based pruning and fine-tuning via backpropagation for efficient pruning.This type of method often requires constant finetuning of the model, which increases the time consumption of pruning decisions.

Based on architectural search-driven criteria
The method searches for the optimal subnetwork structure of the model while training the model.These methods (Dong & Yang, 2019;Liu et al., 2018) often introduce a complex and intensive search process.PruningNet (Liu et al., 2019) is used to predict the parameters of the sub-networks and search for the best sub-networks using evolutionary algorithms.Artificial bee colony algorithms (Lin et al., 2020b) are applied to search architectures whose accuracy is considered as the fitness of each architecture.

Others
GAL (Lin et al., 2019) generates sparsity by forcing the scaling factor in soft masks to zero through generative adversarial network learning.SRR-GR (Wang, Li & Wang, 2021) statistically models the network pruning problem and finds that pruning in the layer with the most structural redundancy outperforms pruning in the least important filter of all layers.NPPM (Gao et al., 2021) uses independent neural networks to predict the performance of sub-networks to maximize pruning performance as a guide and introduce situational memory to update and collect sub-networks during pruning.SCOP (Tang et al., 2020) prunes fake filters with larger scaling factors by training the specified network using fake and real data and mixing it with learnable scaling factors.

Discussion
Popular pruning methods increasingly tend to be more sophisticated, but only yield smaller accuracy gains.In contrast, pruning methods based on model parameter gradient and structural search can obtain better compression and speedup, but pay more computational time.These two problems pose fundamental challenges for deploying CNNs on mobile devices, which are attributed to the lack of theoretical guidance for determining network redundancy.In this article, we verify the effectiveness of the EXP method both theoretically and experimentally by analyzing the role of different channels in changing the expectation of the pixel distribution and exploring the redundancy rather than the importance that channels have.Compared with popular methods, the novelty of EXP is that the pruning strategy is very simple and does not introduce a computationally intensive search process, or need to consider the specificity of data samples.

THE PROPOSED METHOD
This article aims to achieve channel-level random sparsity on the network using a linear relation for d E ."Linear Relations" first introduces the d E -linear relation, followed by a discussion in "Selecting Redundant Channels" on how to achieve random sparsity on the network with this linear relation.First, notations are given to discuss the decision process of the pruning strategy.Suppose C i is the i-th convolutional layer in a trained CNNs and the set of all channels in this convolutional layer can be represented as In the pruning process, the channels in each convolutional layer are divided into two groups, namely the subset to be retained and the subset to be deleted.S and T are respectively the number of channels to be retained and to be deleted, A

Linear relations
For any input image X 2 R MÂN and any convolution kernel (channel) w 2 R UÂV ,then the feature mapping Y 2 R MÂN can be obtained from Y ¼ w X ( denotes the convolution operation).If X and Y are regarded as random variables, then the expectation of X and Y are l and l 0 , respectively.In fact, the convolution operation changes the expectation of the image as follows where ÉðÁÞ denotes the summation of elements and there is a linear relationship between d E and ÉðwÞ.In visual inspection, different convolution kernels implement a variety of operations on the image.Convolutional operations achieve a scaling of the distributional expectation.The distribution expectation reflects the overall information of the feature mapping, so the expectation scaling factor d E works as a basic feature for the channel to reflect the ability to extract features.
In this article, a randomized convolutional kernel experiment is used to validate the relationship between d E of data samples and ÉðwÞ of channels.The randomly selected convolutional kernels have different ÉðwÞ, and the normalized images from the CIFAR-10 dataset are used as input.As illustrated in Fig. 2, the ÉðwÞ of the channel can directly affect the change in the expectation value of the distribution, i.e., the value of d E , for any data sample.This indicates that d E and ÉðwÞ follow an approximately linear relationship.This direct linear relationship provides a theoretical guide to the selection of redundant channels.

Selecting redundant channels
Any channel with similar ÉðwÞ can generate similar d E in different data samples.This article argues that similar d E is the cause of network redundancy, so the channels need to be sparse to reach the role of reducing similar underlying features.And the pruning process should make d E balanced across scales to maintain the rich feature extraction capability of the network.
The pruning strategy in this article aims to identify and sparse the redundant set of channels from W C i .In order to make the features extracted from the original network be effectively retained at different scales, first all channels are divided into several scales and the channels are randomly pruned at the same pruning rate at each scale.Uniformly sparse the d E of convolutional layers can make the sum of channel weights É W C i ð Þof the original channel set W C i have the same distribution as the sum of channel weights É A ð Þ of the retained channel set A, so that the accuracy and generalization performance of the pruned model can be maintained.This article will implement pruning on the model from two aspects, i.e., considering the effects of global pruning and local pruning.
Local pruning.For local pruning (called EXP-A), different filters have divided the set W C i .Let the number of filters be M, then where, P m denotes the m-th filter, 1 m M. Assuming that the class of basic features (non-redundant features) present in the filter P m is N BF m , then pruning is to remove the redundant channels of each basic feature in P m .In the model, a similarity evaluation function S index ðÁÞ is required for the judgment of similarity ÉðwÞ, so as to effectively characterize the redundancy of a certain set of channels D as follows.
where d i is the i-th channel in the channel set D; roundðÁÞ represents rounding calculation; N plays the role of regulating the granularity of similarity, the larger N is, the more similar classes ( basic feature types N BF ) are divided in the set D, the stricter the conditions for judging similarity.Equation (4) essentially divides channels into N + 1 classes, and one class represents one basic feature, so the number of basic feature types contained in D is Using the similarity evaluation function S index ðÁÞ, we can count the number of redundant channels contained in each class of the basic features in P m .Let P n BF m denote the set of channels of the n BF -th class of basic features in P m , 1 n BF N BF m ; w r 0 m;n BF denote the r 0 -th channel in P n BF m , 1 r 0 R 0 .Then the redundancy of P n BF m is defined as: Therefore, the redundant channels can be randomly pruned according to the set compression rate a (0 a 1) for P n BF m .The percentage of deleted channels is: The final pruning process in the EXP-A method is as follows: (i) Set the channel granularity factor N and pruning rate a; (ii) Categorize all channels in P m into N + 1 basic feature subsets one by one by Eq. ( 4), and sort them by the number of elements in the subsets.The subset with the highest number is P n BF m;1 , and the one with the lowest number is P n BF m;Nþ1 .(iii) Calculate the redundancy Re P n BF m;j and the deletion channel ratio q del P n BF m;j of each subset in turn based on Eqs. ( 5) and ( 7), and delete each subset channel randomly according to q del P n BF m;j .The deletion process is as follows: each channel in P n BF m;j is examined in turn, and a random number rand with the interval in [0,1] is generated at each examination, and if rand q del P n BF m;j , the channel is deleted; otherwise, it is retained.
This article notes that the direction of abstraction of CNNs from the underlying features to higher-order features is directed to the same semantic features.This means that the number of types of basic features under each scale in each convolutional layer should be the same, i.e., for the set of channels W C i , which contains the same number of basic features as its subset P m of channels under each scale, there is: Global pruning.Global pruning (called EXP-B) performs random pruning of all channels W C i in the convolutional layer.From "Non-uniqueness and Stability of Subnetworks", it can be seen that ÉðwÞ in each convolutional layer obeys Gaussian distribution.In order to make the basic features in the channel set are effectively retained in different scales, ÉðwÞ is uniformly distributed with the help of Eq. ( 4).Meanwhile, the similarity evaluation function maps the channel weights and ÉðwÞ to the integers in the ÀN=2; N=2 ½ interval, i.e., S index Á ð Þ has the effect of uniformly classifying the ÉðwÞ scales.It can be seen that the pruning using the similarity evaluation function S index Á ð Þ contains the grading of scales, in which the scales are classified into N + 1 levels, i.e., ÉðW C i Þ can be regarded as consisting of N + 1 sets.Thus the EXP-B pruning method replaces the filter P m with the set of channels W C i only in step 2 compared to EXP-A.
After random sparse, the set of channels with similar d E retains only some of the basic features, thus achieving random channel pruning of CNNs layer by layer.This differs from pruning methods using data samples or adaptive pruning, and it helps to save much time in the selection of channels.Also random pruning means that whatever changes the channels make to the distribution expectations of the data samples, this article completes pruning by removing only the channels with similar d E .

EXPERIMENT Experimental settings
This article validates the proposed method using CIFAR-10 (Torralba, Fergus & Freeman, 2008) and ImageNet ILSVRC2012 (Russakovsky et al., 2015) datasets to investigate the efficiency of this method with other methods in reducing model complexity.And it is also tested for networks with different structures, including VGGNet (Simonyan & Zisserman, 2014), ResNet (He et al., 2016), and GoogLeNet (Szegedy et al., 2015).The complexity and performance of the models were evaluated using floating-point operations and TOP-1 accuracy.All experiments were trained and tested on an NVIDIA Telsa P40 graphics card using the Pytorch (Paszke et al., 2017) architecture.
The stochastic gradient descent algorithm (SGD) was adopted to solve the optimization problem.The training batch size was set to 128, the weight decay was 0.0005, and the momentum was set to 0.9.On the CIFAR-10 dataset, fine-tuning was performed with 30 epochs; the initial learning rate was 0.01 and decayed by dividing by 10 at epochs 15 and 30.On the ImageNet dataset, fine-tuning was performed with 20 epochs; the initial learning rate was 0.001 and decayed by dividing by 10 at epochs of 15 and 25.
VGG-16.The EXP method maintains a high accuracy despite the large reduction in FLOPs.EXP-B achieves a 60.8% reduction in FLOPs with a 0.23% reduction in accuracy compared to the baseline model.In contrast, the SENS (Yang & Liu, 2022) based on the model parameter criteria, which uses 2nd-order sensitivity to remove insensitive filters, only results in a 54.1% reduction in FLOPs while decreasing accuracy by 0.53%.When achieving a 70.89% compression of FLOPs, the EXP-B leads to an accuracy reduction of only 0.47%.
ResNet-56/110.In ResNet-56, EXP-B reduces the FLOPs by 60.5%, and the accuracy increases by 0.37% compared to the baseline model.When larger compression is achieved, the FLOPs decrease by 71.9%, and the accuracy decreases by only 0.23%.Compared to the feature selection-based HRank (Lin et al., 2020a), the EXP can effectively compress the model while maintaining stable performance.SRR-GR (Wang, Li & Wang, 2021) removes the most redundant filters from the network by calculating the filter redundancy score, however, the method uses L2 norm as the criterion for filter redundancy, and therefore the pruning results have mediocre performance.In ResNet-110, EXP-B also showed better compression performance than FPC (Chen et al., 2022), with a 70.0%reduction in FLOPs and a 0.28% improvement in accuracy over the baseline model.
GoogLeNet.The results show that EXP-A can reduce the FLOPs by 70.4% with only a 0.03% decrease in accuracy.EPruner (Lin et al., 2021) used the transfer algorithm Affinity Propagation for calculating an adaptive number of samples, which then act as a preserved filter.The EXP method achieves comparable performance to this method, but the pruning method is simpler and easier to implement.
ResNet-50 has more parameters than ResNet-56, and to clearly distinguish the class to which the parameters belong, the value of N in Eq. ( 4) is set to 15. EXP-B resulted in a TOP-1 accuracy of 75.76% when FLOPs were reduced by 53.1%.The SCOP-B (Tang et al., 2020) based on scientific control achieved a similar compression of FLOPs while the accuracy was reduced by 0.89%.The WB (Zhang et al., 2022) based on feature selection preserves the channels that contribute to most categories by visualizing feature mapping, and this method achieves 63.5% FLOPs reduction with 1.94% accuracy reduction.In contrast, the EXP-B method only resulted in a 1.13% decrease in TOP-1 accuracy when achieving a 60.6% decrease in FLOPs, from which it can be concluded that the method based on model parameters can be adapted to different datasets.

Discussion
EXP is based on model parameters and aims to investigate the effect of randomly removing channels with similar d E on the results.Compared to other pruning methods based on model parameters (Yu et al., 2018;Zhao et al., 2019;Yang & Liu, 2022;Lin et al., 2021;Wang et al., 2023), the EXP is simpler and more effective by calculating the sum of weights of convolutional kernels and using a similarity evaluation function redundancy class determination, and finally randomly sparse the set of redundant features to obtain nonunique sub-networks.In traditional pruning work, the norm is commonly used to analyze the importance of the convolutional kernels, while in the work (Wang et al., 2023) the learning schedule and learning rate decay rules are analyzed to reconceptualize the effectiveness of the L1-norm for filter pruning.In the ImageNet ILSVRC2012 dataset, the EXP-A method slightly underperforms the results of L1-norm, while EXP-B improves the Top-1 accuracy by 0.25% in comparison, and this article provides a new idea for analyzing the redundancy properties of convolutional kernels.The feature selection-based methods (Zhang et al., 2022;Chen et al., 2022) show a larger loss due to the fact that the selected sub-networks are more biased towards part of the dataset and the sub-dataset when performing feature selection has a distribution bias from the original dataset.NPPM (Gao et al., 2021) and SCOP (Tang et al., 2020), on the other hand, introduce additional operations to determine the pruning decision, incurring more computational resource consumption.The EXP method, however, is independent of the dataset and is simpler to formulate pruning decisions.
Comparing the results of local pruning (EXP-A) with global pruning (EXP-B) from various aspects, EXP-B shows a better balance.In terms of parameter connectivity, EXP-B has more opportunities to retain the better model connectivity, which is more flexible compared to EXP-A, where one is not restricted to have to connect to a certain filter.In terms of computation, while EXP-A computes the redundancy in the filters one by one, the EXP-B method computes the global redundancy directly, saving more computation time.
In terms of the distribution of the sum of convolutional kernel weights, the EXP-A method has a smaller range for redundancy discrimination, while the distribution of the subnetworks obtained by EXP-B, which involves global discrimination, is closer to the original distribution.
Non-uniqueness and stability of sub-networks

Non-uniqueness
The proposed EXP method generates different sub-networks by randomly pruning the set of redundant channels.Figure 3 shows the distribution of the channel matrix element sums for the 16-th convolutional layer in the pre-trained ResNet-56 model and the distribution after random pruning.The original distribution approximately obeys a Gaussian distribution with skewness, and after various degrees of compression, the randomly preserved subnetworks still approximately obey a Gaussian distribution.One of the reasons why the performance of the subnetwork remains stable is that the convolutional layer distribution of the sparse subnetwork is the same as the original distribution.Pruning using the norm attribute leads to a dispersion of the parameters into two clusters (away from 0), making the parameter distribution discontinuous.In contrast, sparse redundant channels result in a smoother distribution of parameters, which still cover the entire interval but only achieve different degrees of density reduction.

Stability
The experimental results in "Results and Analysis" show that random sparsity does not reduce the performance of the retained subnetwork.Another reason why the subnetwork performance remains stable is that the redundant channels overly focus on the features of the data samples, and the removal of redundant channels introduces generalization effects by weakening the focus on features rather than ignoring them.Table 6 shows the results of multiple random repetition experiments are shown, and the results indicate that random removal of redundant channels can also keep the pruning results stable.For example, ResNet-56 maintains accuracy of about 93.60% after a 60.55% reduction in FLOPs, producing fluctuations that can be controlled within 0.10%.

Generalization impact
It is considered that channels with similar d E will overly focus on local features, thus causing an overfitting phenomenon.Meanwhile, making the redundant channels sparse can improve the generalization ability of the model.Taking the ResNet-56 model as an example, experiments were conducted using EXP-A and EXP-B at different compression rates, and the results are shown in Fig. 4A.It can be seen that the accuracy is always maintained above the baseline level at the pruning rates of 0.53 and 0.60, and the continuous removal of redundant channels can continuously improve the generalization ability of the model to enable the accuracy to be continuously improved.Works (Bartoldson, Barbu & Erlebacher, 2018;Bartoldson et al., 2020) have shown that pruning on the later layers is sufficient to improve the generalization ability of the model.To achieve high compression rates, less compression of the earlier layers and excessive compression of the later layers will cause a decrease in the generalization ability of the model.Figure 4B shows that setting a larger pruning rate for the later layers of the model will destroy the high-level features of the model and make the network performance degrade rapidly.Therefore, the generalization ability cannot be effectively improved, and   the accuracy of the model tends to decrease.So, the pruning rate should be set as globally smooth as possible.

CONCLUSION
Based on the discovery of a linear relationship between the sum of matrix elements of channels and the expectation scaling factor of the pixel distribution, this article proposes a new structured pruning method called EXP.This method uses the linear relationship to establish channels with similar d E as a redundant channel set.By randomly pruning this set of channels, the excessive focus on local features by redundant channels can be weakened, thus obtaining non-redundant and non-unique sub-networks.Through extensive experiments and analysis, the effectiveness of the proposed EXP method is verified.Future work will focus on the generalization effects caused by sparsity to optimize DNNs.
Tiehua Ma conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Figure 1
Figure 1 Random sparsity of channels using the d E of the channels.The channels are categorized using d E as an indicator and the linear relationship of d E is used to achieve stochastic sparsity for the channels.Full-size  DOI: 10.7717/peerj-cs.1564/fig-1

Figure 3 (
Figure 3 (A-F) Distribution of the sum of matrix elements É W C i ð Þ of the channels in the 16-th convolutional layer of the pre-trained and pruned ResNet-56 networks.The red lines indicate the locations of the specific values of the sum of elements.The three percentage values correspond to the percentage of values between −0.1 and 0.1, positive percentage (P), and negative percentage (N).Pr denotes the pruning rate.Full-size  DOI: 10.7717/peerj-cs.1564/fig-3

Figure 4
Figure 4 Performance of the ResNet-56 network at different compression rates.Pr denotes the pruning rate and Conv-Layer denotes the convolutional layer.Full-size  DOI: 10.7717/peerj-cs.1564/fig-4

Table 1
Related work summary.