Simplify: a Python library for optimizing pruned neural networks

Neural network pruning allows for impressive theoretical reduction of models sizes and complexity. However it usually offers little practical benefits as it is most often limited to just zeroing out weights, without actually removing the pruned parameters. This precludes from actual advantages provided by sparsification methods. We propose Simplify, a PyTorch compatible library for achieving effective model simplification. Simplified models benefit of both a smaller memory footprint and a lower inference time, making their deployment to embedded or mobile devices much more efficient.


Motivation and significance
Over the last few years, neural network pruning (i.e. the reduction of the size and complexity of a model through the removal of a set of parameters) has been the subject of extensive research in the scientific community [3,6,18,19,22].
Modern pruning techniques allow for impressive theoretical reduction in both memory requirements and inference time for state-of-the-art neural network architectures. However, most procedures are limited to only identifying which portion of the weights can be set to zero, offering little to no practical advantages when the model is deployed to resource-constrained devices such as mobile phones or embedded systems. While most of the pruning-related works report some form of theoretical speedup, either in terms of FLOPs or inference speed [1], this does not always reflect the actual achievable performance gain and it is usually overestimated.
To solve this issue, we propose Simplify 1 , a PyTorch [14] compatible simplification library that, allows to obtain an actually smaller model in which the pruned neurons are removed and do not weight on the size and inference time of the network. This technique can be used to correctly evaluate the actual impact of a pruning procedure when applied to a given network architecture. Moreover, Simplify allows to apply the simplification process even at training time, in conjunction with pruning techniques, thus reducing the required time for pruning and fine-tuning neural networks. A high level representation of the pruning and simplification pipeline is given in Figure 1 In the related literature it is possible to encounter two class of pruning procedures: unstructured and structured. Unstructured pruning approaches remove single parameters from the network, independently from one another [2,4,9,13,20,21]. When employing this kind of techniques, one can obtain high degree of sparsity, but the pruning of entire neurons is not a guarantee. Structured approaches, on the other hand, focus on the removal of whole neurons, leading to the imposition of some kind of structure over the pruned topology [10,19,23]. Since our proposed library removes the pruned neurons from the network, we will focus on models pruned using structured techniques. Various accelerators, both hardware and software, for sparse neural networks have been proposed [11,15,25,26]. The main downside of this kind of solutions is the requirement for specific hardware or software, that can be hardly applied to standard consumer devices. Furthermore, they are designed to apply inference-time acceleration using the zero-filled model instead of building an optimized structure, thus precluding the ability to train a pruned neural network.
Simplify solves these issues by extracting the remaining structure from a pruned model, and removing all the zeroedout neurons from the network. This allows to obtain a model that can be saved, shared and used without any special hardware or software. While, at a first glance, this may seem a straightforward procedure, the removal of zeroed neurons poses some hidden challenges like the presence of bias in said neurons or some constraints in the output's dimensions due to skip or residual connections. Even though the interest of the deep learning community on the matter seems to be quite strong 2 , very few approaches and libraries for simplifying pruned models have been proposed 3 . Moreover, they are usually limited to simpler architectures such as VGG [16], and their usage is restricted to the deployment of an already pruned model. On the other hand, with Simplify, we provide a way to: 1. Optimize more complex network architectures (e.g. ResNet [5], DenseNet [7] and so on), and, in general, custom architectures, without constraints given by the connectivity patterns (i.e. residual connections); 2. Optimize models during training: this allows to obtain speed-ups in the time required for training a model and reduce the memory occupation, when applied together with an iterative pruning technique.

Software Description
The Simplify library leverages on the main PyTorch packages and is composed of three main modules that, even if designed to function in a predefined order, can be used independently based on the user requirements. We now provide a brief overview of each module functionalities and purpose. A more detailed explanation of the maths involved in each module is provided in the Appendix.
Fuse First, we have the fuse module. Here we perform a non-mandatory optimization of the model by merging, in a single Convolutional layer, pairs of consecutive Convolutional and Batch Normalization layers. This process is know as Batch Normalization fusion or folding. This step can be ignored if the presence of Batch Normalization layers in the network is required, i.e. for further training of the simplified model. This step is not needed to define the simplified model, but provides inference-time and memory usage advantages, especially when deploying a trained model to production, thanks to an optimization of the model architecture.
Propagate The second module is called propagate. With this module we solve the problem of non-zero bias in zeroed neurons mentioned in Sec. 1. It is possible that some pruned neuron retain non-zero bias; in such situation it would be impossible to remove the neuron without losing the bias contribution. To solve this problem, in the propagate module, we essentially treat such neurons as a constant signal that can then be absorbed by the next layer, making the zeroed neuron removable.
Remove Lastly, with the remove module we perform the actual simplification of the model, removing the zeroed-out neurons. Here we make sure that the output and input dimensions of adjacent layers correspond, while also taking into account architecture constraints such as the presence of skip connections.

Illustrative Examples
In this section we provide an usage overview for Simplify. We also illustrate the results obtained for the two different use cases discussed in Sec. 1, namely optimization for model deployment and optimization during training.
Optimization for deployment This is the most common use case. Here, the simplification procedure is applied on an already trained model, on which a pruning criterion has been previously applied. In most cases, a one-line call to the simplify method is sufficient: the library performs all the three steps autonomously, and takes care of different architectural patterns such as residual connections. Below, we provide a sample code snippet. Tab. 1 shows the inference times (in milliseconds) of different standard PyTorch dense models, the resulting pruned models (random, structured pruning with 50% probability) and the simplified model obtained with our proposed library. The benchmarks are run on a Intel(R) Core(TM) i9-9900K CPU, with a batch size of 1 in order to simulate a one-shot inference of a deployed model. The results are averaged across 1000 different runs for each architecture. It's easy to see that, thanks to Simplify, the resulting model is actually faster and able to leverage on the applied pruning while remaining a fully-fledged PyTorch network. Additional results for all the torchvision architectures can be found in the repository README file. Optimization for training Most modern network architectures employ Batch Normalization as a way to improve generalization. To avoid losing the Batch Normalization contribution, we provide the ability to avoid the fusion step, so that these layers are retained. To further improve training time, it is possible to enable a training mode for simplify, which helps in decreasing inference time. More details are provided in Sec. C.1. Below, we provide a sample code snippet.

Impact
Current SOTA pruning research base their results on theoretical estimation of the models improvement. They offer poor practical benefits due to the lack of removal of pruned neurons that still weight on the model computation, especially when deployed to resource constrained devices like mobile phones. Simplify provides out of the box functionalities to translate the impressive theoretical results of pruning procedure to an actual shrinking of the neural network model, reducing both memory requirements and inference time. It allows for a more precise evaluation of pruning procedures, enabling systematic comparison within scientific research, and helps during deployment, allowing for the full exploitation of the pruned network without the need for ad hoc hardware platforms.

Conclusions
We propose the PyTorch compatible library Simplify, with the aim of providing a simple-to-use set of procedures to remove zeroed neurons from a neural network architecture. The proposed library solves different issues in the creation of simplified models, such as the propagation of the bias of pruned neurons and the shape constraint of skip connections.
The proposed library is composed of three modules that, while designed to work together, can be used independently from one another according to the required functionality for a specific setting.

Conflict of Interest
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

CRediT authorship contribution statement
Andrea Bragagnolo: Conceptualization of this study, Methodology, Software. Carlo Alberto Barbano: Conceptualization of this study, Methodology, Software.

A. Batch Normalization fusion
A vast amount of modern neural networks use Batch Normalization (from here on out BatchNorm) as a way to improve generalization. Given an input , we can define the output of BatchNorm as: where and represent, respectively, the weights and bias of the layer and are learned using standard backpropagation procedures; and 2 represent the mean and variance computed over a batch. During training this layer keeps running estimates of its computed mean and variance, which are then used for normalization during evaluation. Let us denote this approximations aŝ and̂ 2 . Notice that each parameter is defined for each channel of the input feature map; we will denote them as , ,̂ and̂ 2 for a given channel . Once a neural network is trained to completion, all the parameters of its layer can be considered frozen i.e. no longer update from further training. Also in standard network architectures, is possible to identify pairs of Convolutional and BatchNorm layers whose output is of the same size. In such conditions is possible to reduce the network complexity by fusing this two layers them into a single one. Note that this operation is only applicable if there is no non-linearity between the two layers.
Let us consider a generic Batch Normalization's output this can be rewritten as since this BatchNorm layer is preceded by a Convolutional layer, can be defined as where is the input of the Convolutional layer, are its weights and its bias. We can now express the BatchNorm output as a function of the Convolutional layer, substituting Eq. 4 in Eq. 3.
Leveraging on Eq. 5, we can finally fuse the Convolutional and the BatchNorm layer in a single Convolutional layer whose weights and bias are defined as and the output is therefore = ⋅ + (8)

B. Bias propagation
This step is necessary if biases are presents in the model's hidden layers, or are introduced by the fusion of batch normalization layers. Neurons with zeroed-out channels might have non-zero bias, and so they will fire a constant output value. Hence, a neuron cannot immediately be removed if the corresponding bias is nonzero. These values, however, can be propagated and accumulated into the biases of the next layer. This operation can be repeated until all of the biases have been propagated to the last layer of the network. After a bias has been propagated, it can then be set to zero in the original neuron, which in turn allows the removal of the whole weight channel.

B.1. Linear layers
We denote as 1 = ⟨ , ⟩ and 2 = ⟨ , ⟩ two sequential linear layers. and denote the weight matrix and bias vector of 1 , of size × and respectively. and denote the weight matrix and bias vector of 2 of size × and respectively. We also denote as the activation function (e.g. ReLU). A forward pass for 1 consists in: (where represents an input vector of size ) and for 2 : Focusing on Equation 9, we can visualize the vector-matrix product: We now suppose that some output channel of has been zeored-out following the application of some pruning criterion, e.g. every entry in 1 is zero. The multiplication becomes: We now focus on the forward pass of 2 . As example, we analyze what happens with the first neuron 0 . If we rewrite Equation 10 focusing on 0 we obtain: 0 = ( 0 + 0 ) 0,0 + ( ) , + ⋯ + ( + ) 0, + 0 (11) We now focus on the forward pass of 2 . As example, we analyze what happens with the first neuron 0 . If we rewrite Equation 10 focusing on 0 we obtain: 0 = ( 0 + 0 ) 0,0 + ( ) , + ⋯ + ( + ) 0, + 0 (12) The term ( 1 ) 0,1 is a constant which can be accumulated into 0 . The same reasoning can be extended to all neurons in 2 , by adding ( 1 ) multiplied with the respective incoming weight to the neuron bias. The new set of biaseŝ for the layer can be written as:̂ = ⎡ ⎢ ⎢ ⎢ ⎣ 0 + ( 1 ) 0,1 1 + ( 1 ) 1,1 ⋮ + ( 1 ) ,1 ⎤ ⎥ ⎥ ⎥ ⎦ and the original bias 1 can be set to zero in 1 , resulting in̂ = 0 , 0, 1 , … , . This procedure can be applied when multiple neurons are pruned in 1 and the general rule to obtain the updated biaseŝ is as follows: where represents the indices of zeroed channels in 1 . After the bias propagation procedure, the layers 1 and

B.2. Convolutional layers
A similar reasoning can be applied for convolutional layers. However, the propagation process needs to take into account whether the convolution employs zero-padding on the input tensor or not.
For the sake of simplicity, using the same notation of Section B.1, let us consider two sequential convolutional layers 1 = ⟨ , ⟩ and 2 = ⟨ , ⟩. We also assume that 1 has one input channel and two output channels ( has shape 2 × 1 × 1 × 1 and is a vector of length 2), while 2 has two input channels and one output channels ( has shape 1 × 2 × 2 × 2 , and is a vector of length 1).
The forward pass for 1 is: where * represents the convolution operation and is a properly sized input. In this context, the addition operation + between the resulting feature map = * and the corresponding bias value will perform a shape expansion of to match the feature map shape, for example: We now assume that the second channel 1 of 1 has been zeroed out after the application of some pruning criterion, hence if we consider 1 + 1 we obtain: where . denotes that the element shape has been expanded.
We now analyze what happens with 2 . For the sake of simplicity, we assume that = = 3, that 2 = 2 = 2 and that every value of is equal to 1. We also consider a stride value of 1 for 2 . Convolution without padding (or "same" padding): This is the simpler case, and it is similar to the linear layers (Section B.1). The forward pass of 2 can be expressed as follows: The factor ( 1 ) * 0,1 is constant and can be accumulated into 0 . Visualizing it, we obtain: In this case, the updated bias can be converted as a scalar replacing the original value 0 : given that the resulting matrix is constant, we can directly factor out 4 ( 1 ) and set 1 to 0 in 1 , obtaining a new biaŝ 0 = 4 ( 1 ) + 0 which will be used from now on in 2 4 . The same reasoning can be extended to the case of multiple neurons in the convolution layer and multiple pruned channel in the preceding layer: each bias value will be updated according to the rule in Equation 14. The general rule to obtain the new bias vector̂ can be expressed as follows: where represents the indices of the zeroed output channels in 1 .

Convolution with zero-padding
If the convolution applies zero-padding to the input values, then the bias cannot be accumulated into a scalar, as the resulting matrix will not be constant. To show this, we rewrite Equation 14 applying a zero-padding of size 1 along each dimension of the input tensor: where ′ = ( 1 ) for brevity. In this case, the new bias value need to be maintained in a matrix form, i.e.: 0 = ⎡ ⎢ ⎢ ⎢ ⎣ ′ + 0 2 ′ + 0 2 ′ + 0 ′ + 0 2 ′ + 0 4 ′ + 0 4 ′ + 0 2 ′ + 0 2 ′ + 0 4 ′ + 0 4 ′ + 0 2 ′ + 0 ′ + 0 2 ′ + 0 2 ′ + 0 ′ + 0 ⎤ ⎥ ⎥ ⎥ ⎦ To obtain the updated biases in case of multiple neurons and multiple channels, the same rule of Equation 15 can be applied, keeping in mind that in this case it will result into a tensor of shape × × instead of a vector. This introduces a constraint on the feature map size, hence the model can only ever be used at a fixed input size. However, given that the whole simplification procedure is executed on an already trained model, before deploying to production, it should not represent a major issue.

B.3. Residual connections
While the above process works fine for simple feed-forward models, special care must be taken to handle residual connections. As an example, let us consider the case of two linear layers 1 = ⟨ , ⟩ and 2 = ⟨ , ⟩, whose outputs and are summed together in a residual connection, followed by another layer 3 = ⟨ , ⟩: where 0 denotes that a channel was pruned. The residual (sum) operation introduces a new constraint: only biases corresponding to matching pruned channels in 1 and 2 can be propagated to the next layer. To see why, we can rewrite Equation 17 as Equation 12 and obtain: 0 = ( 0 + 0 + 0 ) 0,0 + ( 1 +̂ 1 + 1 ) 0,1 + ⋯+ + ( − + − ) , − + + ( + +̂ + ) 0, + 0 It is clear that even if multiple channels are pruned from 1 and 2 , only the factor ( −1 + −1 ) 0, −1 becomes a constant. In this case, we opt not to propagate any bias and employ an expansion scheme (Sec. C.1) to achieve a speed-up in the convolution operations anyways.