Deep Convolutional Neural Network Compression Method: Tensor Ring Decomposition with Variational Bayesian Approach

Liu, Weirong; Zhang, Min; Shi, Changhong; Zhang, Ning; Liu, Jie

doi:10.1007/s11063-024-11465-8

Deep Convolutional Neural Network Compression Method: Tensor Ring Decomposition with Variational Bayesian Approach

Open access
Published: 13 March 2024

Volume 56, article number 103, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Deep Convolutional Neural Network Compression Method: Tensor Ring Decomposition with Variational Bayesian Approach

Download PDF

Weirong Liu¹,
Min Zhang¹^na1,
Changhong Shi¹^na1,
Ning Zhang¹^na1 &
…
Jie Liu¹^na1

330 Accesses
Explore all metrics

Abstract

Due to deep neural networks (DNNs) a large number of parameters, DNNs increase the demand for computing and storage during training, reasoning and deployment, especially when DNNs stack deeper and wider. Tensor decomposition can not only compress DNN models but also reduce parameters and storage requirements while maintaining high accuracy and performance. About tensor ring (TR) decomposition of tensor decomposition, there are two problems: (1) The practice of setting the TR rank to be equal in TR decomposition results in an unreasonable rank configuration. (2) The training time of selecting rank through iterative processes is time-consuming. To address the two problems, a TR network compression method by Variational Bayesian (TR-VB) is proposed based on the Global Analytic Solution of Empirical Variational Bayesian Matrix Factorization (GAS of EVBMF). The method consists of three steps: (1) rank selection, (2) TR decomposition, and (3) fine-tuning to recover accumulated loss of accuracy. Experimental results show that, for a given network, TR-VB gives the best results in terms of Top-1 accuracy, parameters, and training time under different compression levels. Furthermore, TR-VB validated on CIFAR-10/100 public benchmarks achieves state-of-the-art performance.

Stable Low-Rank Tensor Decomposition for Compression of Convolutional Neural Network

Fast and Robust Compression of Deep Convolutional Neural Networks

Convolutional Neural Network Compression via Tensor-Train Decomposition on Permuted Weight Tensor with Automatic Rank Determination

1 Introduction

In recent years, DNNs have achieved great success in various fields, such as information fusion [1,2,3], natural language processing [4,5,6], and computer vision [7]. However, DNNs require a large amount of memory and computational resources, resulting in DNNs relying heavily on high-performance computing devices. For example, VGG-16 [8] model has 140 million parameters and occupies more than 500 M of storage space. In order to construct compact DNN models with less memory and computational cost, it is essential to reduce DNN model parameters while ensuring accuracy.

DNNs compressed methods are divided into five categories according to different compression mechanisms [9]: pruning [10, 11], quantization [12, 13], knowledge distillation [14, 15], low-rank decomposition [16, 17], and network architecture search [18,19,20]. Among the 5 DNN compression methods, Low-rank decomposition achieves higher compression ratios and more efficient operations by approximating the original data with a lower-rank representation. Due to the superior multidimensional representation and higher expressive power of tensor decomposition compared to matrix decomposition, the proposed primarily focuses on tensor decomposition within the low-rank approximation method.

Tensor decomposition is a very attractive DNN model compression technique as a mathematical tool for exploring low-rank features of large-scale tensor data. Many tensor decomposition methods have been research, such as canonical polyadic (CP) decomposition [21, 22], Tucker decomposition [23, 24], tensor train (TT) decomposition[25, 26], and tensor ring (TR) [27, 28] decomposition. The TR decomposition is characterized by its ability to connect endpoints in a circular format without strict multiplication sequence of nodes, thus offering greater flexibility and enhancing the representation capability of high-order tensors. Although the TR decomposition has advantages in representation capabilities and flexibility, it is worth noting that the compression performance of the TR decomposition is determined by the TR rank.

The practice of using equal TR ranks in TR decomposition leads to unreasonable rank configuration. The current research on selecting rank is faced with the challenge of iterative processes, which can introduce a substantial computational burden and significantly extend the training time. For example, PSTRN [29] and TR-RL [30]. PSTRN [29] method progressively searches for TR ranks through a genetic algorithm. PSTRN [29] method sets the interest region related to rank and reduces the interest region after each iteration to select optimal rank. However iterative approach consumes a lot of time. The overall training time is more than 30 h.TR-RL [30] method selects rank by reinforcement learning. DDPG algorithm is used to select optimal rank of each convolutional layer, but the interaction between environment and agent will also consume a lot of time. The overall training time reaches 6 h.

A TR network compression method by Variational Bayesian (TR-VB) is proposed to address the issues of unreasonable rank configuration and long training time. TR-VB method draws inspiration from VBMF and utilizes numerical computations to determine the TR rank along input and output channel dimensions. And then instruct the TR decomposition. The proposed TR-VB makes the configuration of TR rank more reasonable and reduces time cost. TR-VB’s main contributions in this work can be summarized as follows.

1.
A new TR-VB network compression method is proposed to solve the problem of unreasonable rank configuration caused by TR ranks equal. TR-VB method can significantly reduce the number of parameters, while ensuring accuracy.
2.
TR-VB training time cost is reduced significantly by TR ranks selection through VBMF. Because TR-VB can automatically search for TR ranks without manual setting or an iterative approach.

The remainder of this manuscript is organized as follows. Section 2 introduces TR decomposition and VB approximation for DNN compression. Section 3 proposes the TR-VB method. Section 4 analyzes experimental results. Section 5 summarizes.

2 Related Works

2.1 Tensor Ring Decomposition

TR decomposition is formed by a series of connected third-order tensor nodes in circular structure, which resolved the strict sequential order required for the multi-linear products of cores in TT decomposition and improved representation ability and flexibility of TT decomposition [31]. Specifically, let ${\mathcal{W}}$ be a $d$ th-order tensor of size $n_{1} \times n_{2} \times \cdots \times n_{d}$, denoted by ${\mathcal{W}} \in {\mathbf{R}}^{{n_{1} \times n_{2} \times \cdots \times n_{d} }}$, TR representation is to decompose it into a sequence of latent tensors ${\mathcal{Z}}_{k} \in R^{{r_{k} \times n_{k} \times r_{k + 1} }} ,k = 1, \cdots ,d$, TR decomposition can be expressed in index form given by,

$$ {\mathcal{W}}(i_{1} , \cdots ,i_{d} ) = \sum\limits_{{\alpha_{1} , \cdots ,\alpha_{d} = 1}}^{{r_{1} \cdots r_{{\text{d}}} }} {\prod\limits_{k = 1}^{d} {{\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} )} } $$

(1)

where $\left\{ {r_{1} \cdots r_{d + 1} } \right\}$ denotes TR ranks, each node ${\mathcal{Z}}_{k} \in R^{{r_{k} \times n_{k} \times r_{k + 1} }}$ is a 3rd-order tensor and $r_{d + 1} = r_{1}$.

2.2 Variational Bayesian Approximation

Variational Bayesian (VB) approximate method has been successfully applied to matrix factorization (MF), and offer the advantage of automatically determining the dimensions of principal component. However, many existing methods implement VB approximate through local search algorithms which obtained through standard procedures. In contrast, Variational Bayesian Matrix Factorization (VBMF) through numerical analysis to determine the global solution of the cost function [32]. Specifically, the global solution in VBMF is obtained from reweighted singular value decomposition (SVD) of the observed matrix, where each weight is derived by solving a quartic equation with coefficients derived from the observed singular value functions. Unlike standard iterative algorithms, VBMF computes the global optimal solution without iterations. Additionally, an empirical VB scenario is considered in which hyperparameters (prior variance) are learned from the data. GAS of EVBMF eliminates the need for manual parameter tuning and facilitates calculation the global optimal solution [32].

3 The Proposed TR-VB

A TR-VB method based on the numerical calculation of GAS of EVBMF is proposed. TR-VB method differs from setting equal TR ranks in TR decomposition. TR-VB approach configures TR ranks in tensor decomposition more reasonably, resulting in improved of the TR network performance. As shown in Fig. 1, TR-VB consists of three parts: rank selection, TR decomposition, and fine-tuning of compression model. Firstly, rank is selected by GAS of EVBMF after unfolding 4-D tensor of convolution layer parameters into 2-D matrix, as shown in the orange box in Fig. 1. Secondly, TR-VB uses TR decomposition to decompose convolution layer parameters after block processing, as shown in the blue box in Fig. 1. Thirdly, fine-tuning is designed to reduce performance loss caused by TR decomposition on compressed model, as shown in the green box in Fig. 1.

3.1 Rank Selection

To address the time efficiency problem caused by rank selection through iteration, TR-VB leverages the GAS of EVBMF method to automatically determine the rank for matrix decomposition. The selection process consists of three steps: firstly, unfold convolution layer parameters from a 4-D tensor to a 2-D matrix. Secondly, the selection rank of convolution layer parameters. Finally, slackness and retrencher selection rank. The following will use the form of an itemized list to explain the specific design details of the TR-VB method.

1. In the tensor ring, the rank of the latent tensor is indeed related to the latent tensor itself. To select the rank of latent tensors in the tensor ring, the parameters of convolutional layers need to unfold from 4-D tensors into 2-D matrices.

First, for the sake of convenience, the 4-D tensor ${\mathcal{W}}$ of the convolution layer parameter is unfolding into a 3-D tensor ${\mathcal{G}}$ according to filter space dimension, as shown in Eq. (2),

$$ {\mathcal{G}} = \left[ {\begin{array}{*{20}c} {{\mathcal{W}}(1,1,:,:)} & {{\mathcal{W}}(1,2,:,:)} & \cdots & {{\mathcal{W}}(1,C_{out} ,:,:)} \\ {{\mathcal{W}}(2,1,:,:)} & {{\mathcal{W}}(2,2,:,:)} & \cdots & {{\mathcal{W}}(2,C_{out} ,:,:)} \\ \vdots & {} & \ddots & \vdots \\ {{\mathcal{W}}(C_{in} ,1,:,:)} & {{\mathcal{W}}(C_{in} ,2,:,:)} & \cdots & {{\mathcal{W}}(C_{in} ,C_{out} ,:,:)} \\ \end{array} } \right] \in {\mathbf{R}}^{{C_{in} \times C_{out} \times (kk)}} $$

(2)

where ${\mathcal{G}}$ denotes 3-D parameter tensor obtained after unfolding, ${\mathcal{W}}$ denotes 4-D parameter tensor of each layer before unfolding, $C_{in}$,$C_{out}$ and $k$ represents the input channel dimension, output channel dimension, and convolution kernel spatial dimension in a convolutional layer, respectively.

Then, due to the decomposition performed by TR-VB between the input and output channels, the 3-D tensor ${\mathcal{G}}$ is unfolded to 2-D matrix ${\mathcal{G}}_{{{{in}}}}$, ${\mathcal{G}}_{{{{out}}}}$, as shown in Eqs. (3) and (4),

$$ {\mathcal{G}}_{{{{in}}}} = \left[ {{\mathcal{G}}(:,:,1),{\mathcal{G}}(:,:,2), \cdots ,{\mathcal{G}}(:,:,n_{3} )} \right] \in R^{{C_{in} \times (C_{out} kk)}} $$

(3)

$$ {\mathcal{G}}_{{{{out}}}} = \left[ {{\mathcal{G}}(:,:,1)^{\text{T}} ,{\mathcal{G}}(:,:,2)^{\text{T}} , \cdots ,{\mathcal{G}}(:,:,n_{3} )^{\text{T}} } \right] \in R^{{C_{out} \times (C_{in} kk)}} $$

(4)

where ${\mathcal{G}}_{{{{in}}}}$ and ${\mathcal{G}}_{{{{out}}}}$ denote 2-D matrix obtained by unfolding convolution layer parameter tensor in input channel dimension and output channel dimension.

2. Because the advantage of the GAS of EVBMF method lies in its ability to effectively reduce the computational time required for numerical calculations rank, TR-VB selects TR rank $\left[ {\hat{R}_{{{{in}}}} ,\hat{R}_{{{{out}}}} } \right]$ through analytical calculation to reduce the parameter redundancy in DNNs, as shown in Eq. (5),

$$ \left[ {\hat{R}_{{{{in}}}} ,\hat{R}_{{{{out}}}} } \right] = g_{{{{evbmf}}}} ({\mathcal{G}}_{{{{in}}}} ,{\mathcal{G}}_{{{{out}}}} ) $$

(5)

where $\hat{R}_{{{{in}}}}$ and $\hat{R}_{{{{out}}}}$ denote rank of input channel dimension and output channel dimension, and $g_{{{{evbmf}}}} ()$ denotes estimation parameter tensor rank using GAS of EVBMF.

3. Slacking selection rank $\widetilde{R}_{{{{in}}}}$, $\widetilde{R}_{{{{out}}}}$ to improve information carrying capacity of decomposed tensor. Owing to the loss of some high-dimensional structural information when unfolding a 4D tensor into a 2D tensor, it becomes necessary to slack the selected ranks to enhance the information-bearing capacity of the decomposed tensor. As TR-VB aim to control the slack factor between 0 and 1, TR-VB employ the concept of normalization and obtain the slack ranks via Eqs. (6) and (7),

$$ \widetilde{R}_{{{{in}}}} = \hat{R}_{{{{in}}}} + k_{{{{in}}}} (C_{{{{in}}}} - \hat{R}_{{{{in}}}} ) $$

(6)

$$ \widetilde{R}_{{{{out}}}} = \hat{R}_{{{{out}}}} + k_{{{{out}}}} (C_{{{{out}}}} - \hat{R}_{{{{out}}}} ) $$

(7)

where $k_{{{{in}}}}$ and $k_{{{{out}}}}$ denote the slackness coefficient of tensor rank in dimension of input channel and output channel, $0 < k_{{{{in}}}} < 1$, $0 < k_{{{{out}}}} < 1$, $\widetilde{R}_{{{{in}}}}$ and $\widetilde{R}_{{{{out}}}}$ denote rank in the dimension of input channel and output channel after slackness.

The excessive slackness of rank ${\text{R}}_{{{{in}}}}$, ${\text{R}}_{{{{out}}}}$ is tightened to achieve a reasonable match of in the TR rank. Retrenching coefficient is introduced after slackness of rank, as shown in Eqs. (8) and (9),

$$ {\text{R}}_{{{{in}}}} = t_{{{{in}}}} \widetilde{R}_{{{{in}}}} $$

(8)

$$ {\text{R}}_{{{{out}}}} = t_{out} \widetilde{R}_{{{{out}}}} $$

(9)

where $t_{{{{in}}}}$ and $t_{out}$ denote retrenching coefficient of the tensor rank in dimension of input channel and output channel, $0 < t_{{{{in}}}} < 1$, $0 < t_{out} < 1$, ${\text{R}}_{{{{in}}}}$ and ${\text{R}}_{{{{out}}}}$ denote rank in the dimension of input channel and output channel after retrenching.

3.2 Tensor Ring Decomposition

The section introduces the TR decomposition process of parameters. It is known that ${\mathcal{W}}_{{{i}}}$ denotes 4-D parameter tensor of each layer before unfolding, the convolution layer operations are shown in Eq. (10),

$$ {\mathcal{Y}}_{{{i}}} = h({\mathcal{X}}_{{{i}}} ;{\mathcal{W}}_{{{i}}} ) $$

(10)

where ${\mathcal{X}}_{{{i}}}$ and ${\mathcal{Y}}_{{{i}}}$ denote input and output characteristics of convolution layer, $h( \bullet ; \bullet )$ denotes the operation of convolutional layer from input feature to output feature.

1. To reduce parameters in DNNs, TR-VB divide the input and output channel dimensions into blocks to form three latent tensors in the tensor ring, along with two latent tensors from filter dimensions, amounting to a total of eight latent tensors. Therefore, the high-order tensor ${\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{8} )$ consisting of eight nodes, as shown in Eq. (11),

$$ {\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{8} ) = \sum\limits_{{\alpha_{1} , \cdots ,\alpha_{8} = 1}}^{{r_{1} \cdots r_{8} }} {\prod\limits_{k = 1}^{8} {{\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} )} } $$

(11)

where ${\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{8} )$ denotes $(i_{1} , \cdots ,i_{8} )$ th element of tensor, $\left\{ {r_{1} \cdots r_{9} } \right\}$ denote TR ranks, each node ${\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} ) \in R^{{r_{k} \times n_{k} \times r_{k + 1} }}$ is a 3rd-order tensor and $r_{9} = r_{1}$. $\alpha_{k}$ is index of latent dimensions.

2. Due to the small size of convolutional kernel, the parameter compression ratio is not significant during the decomposition process, so two 3rd-order tensors of the filter space dimension are contracted. the contracted high-order (or high-dimensional) tensor ${\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{7} )$ as shown in Eq. (12),

$$ {\mathcal{W}}(i_{1} , \cdots ,i_{7} ) = \sum\limits_{{\alpha_{1} , \cdots ,\alpha_{7} = 1}}^{{r_{1} \cdots r_{7} }} {\prod\limits_{k = 1}^{7} {{\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} )} } $$

(12)

where ${\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{7} )$ denotes $(i_{1} , \cdots ,i_{7} )$ th element of tensor, $\left\{ {r_{1} \cdots r_{8} } \right\}$ denote TR ranks, each node ${\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} ) \in R^{{r_{k} \times n_{k} \times r_{k + 1} }}$ is a 3rd-order tensor and $r_{8} = r_{1}$. $\alpha_{k}$ is index of latent dimensions.

3. The selected rank is configured in TR decomposition. Set the first four ranks to input channel selection rank $r_{1} = {\text{r}}_{2} = r_{3} = r_{4} = {\text {R}}_{in}$ in TR, and the last three ranks to output channel rank $r_{5} = r_{6} = r_{7} = {\text {R}}_{out}$ in TR, Higher order tensors after configuration of rank ${\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{7} )$ as shown in Eq. (13) and Fig. 2,

$$ \left\{ \begin{gathered} {\mathcal{W}}_{i} (i_{1} , \cdots ,i_{7} ) = \sum\limits_{{\alpha_{1} , \cdots ,\alpha_{7} = 1}}^{{r_{1} \cdots r_{7} }} {\prod\limits_{k = 1}^{7} {{\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} )} } \hfill \\ {\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} ) \in R^{{r_{k} \times n_{k} \times r_{k + 1} }} ,r_{8} = r_{1} \hfill \\ r_{1} = {\text{r}}_{2} = r_{3} = r_{4} = R_{in} \hfill \\ r_{5} = r_{6} = r_{7} = R_{out} \hfill \\ \end{gathered} \right. $$

(13)

where ${\text{R}}_{{{{in}}}}$ and ${\text{R}}_{{{{out}}}}$ denote rank selected by input channel and output channel dimensions of the parameter tensor.

According to the tensor ${\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{7} )$ obtained by TR decomposition, the compressed convolutional layer is constructed, as shown in Eq. (14),

$$ {\mathcal{Y}}_{{{i}}} = h({\mathcal{X}}_{{{i}}} ;{\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{7} )) $$

(14)

3.3 Fine-Tune

Because the compression of DNNs will cause some performance loss, it is necessary to recover the compressed model accuracy by fine-tuning. Fine-tuning is to retrain compressed DNN with several epochs. The fine-tuning objective function is shown in Eq. (15),

$$ {\mathcal{W}}^{*} { = }\mathop {\arg \min }\limits_{{{\tilde{\mathcal{W}}}}} \sum\limits_{k = 1}^{N} {\ell (f({\mathcal{X}}_{k} ;{\tilde{\mathcal{W}}}),{\mathcal{Y}}_{k} )} $$

(15)

where $\left\{ {\left( {{\mathcal{X}}_{k} ,{\mathcal{Y}}_{k} } \right)} \right\}_{k = 1}^{N}$ denotes training samples, $N$ denotes training samples. $\ell ( \bullet , \bullet )$ denotes loss function for training, which is used to measure deviation between DNN output and label. ${\mathcal{W}}^{*}$ denotes parameters of compressed DNN after fine-tuning.

The pseudocode of TR-VB algorithm is shown as below. In the first step, TR-VB method unfold the four-dimensional tensor and select the rank by GAS of VBMF. In the second step, compressing the neural network layer via TR decomposition using the rank obtained in the previous step. In the third step, further fine-tuning the compressed model to overcome some of the performance loss due to compression.

Algorithm 1 Pseudocode (TR-VB)

4 Experiments and Discussions

Experiments in image classification task are designed to demonstrate generalization, advancement and effectiveness of the proposed TR-VB.

Firstly, to evaluate the generalization capability of TR-VB, TR-VB are evaluated on Resnet20, Resnet32, and Resnet56 [33] models using the CIFAR-10 [34] and CIFAR-100 [34] datasets. Both CIFAR10 and CIFAR100 datasets consist of 50,000 train images and 10,000 test images with size as 32 × 32 × 3. CIFAR10 has 10 object classes and CIFAR100 has 100 categories.

Secondly, TR-VB’s result are compared with 15 state-of-the-art compression methods to validate the advancement of TR-VB, which contains following three categories: (1)TR network compression methods based on other rank selection, such as TR-RL [30] base on reinforcement learning and PSTRN-M/-S [29] base on progressively searching tensor ring network, (2)Low-rank decomposition methods, such as a novel compact design method for convolutional layers with spatial transformations—LCT [35], a method based on enumeration for solving the optimal rank and CUR decomposition—LCCUR [36], use sparsity-inducing matrices to Hinge [37] filter pruning and decomposition, an alternating optimization framework that integrates learning and compression (LC [38]) algorithms, enabling simultaneous network training and network compression, an AT [39]approach to use tensor decomposition to reduce training time of training a model from scratch, a CNN model with factorized convolutional filters (CNN-FCF [40]) by updating the standard filter using back-propagation, TRN [41] method base on TR decomposition, Tucker [42] and TT [43], (3)Other state-of-the-art methods, such as alternates between low rank approximation and training—TRP [44], a kernel sparsity and entropy (KSE [45]) indicator is proposed to quantitate the feature map importance in a feature-agnostic manner to guide model compression, a sparse structure selection (SSS [46]) effective framework to learn and prune deep models in an end-to-end manner and a filter level pruning method for deep neural network compression-ThiNet [47].

Finally, comparative experiments on with or without a GAS of VBMF to verify the effectiveness of GAS of EVBMF.

The experimental parameters are configured as shown in Table 1. Experiments are initialized from randomly initialized models. TR-VB is trained via SGD with momentum 0.8 and a weight decay of 1 × 10⁻⁴ on mini-batches of size 128. The random seed is set to 0. The loss function is cross entropy. TR-VB is trained for 200 epochs with an initial learning rate as 0.1.

Table 1 Parameters configuration of Resnet20, Resnet32 and Resnet56 training

Full size table

The hardware configuration of the experiments is Intel(R) Core (TM) i7-10,700 CPU and NVIDIA GeForce GTX 2080Ti GPU. Software development tools mainly include PyTorchv1.1.0 and Tensorlyv0.4.5.

4.1 Image Classification Experiments and Comparison

In order to validate generalization and advancement of TR-VB, the results of compressing classification networks such as Resnet20, Resnet32 and Resnet56 on CIFAR-10 and CIFAR-100 dataset are reported, the network models are obtained at different compression ratios (each subfigure represents the results obtained at different compression ratios for the same network in Fig. 4).

Figure 3 show the comparison results of TR-VB with two advanced TR network compression methods. Figure 4 show eight low-rank decomposition methods and others third state-of-the-art methods in CIFAR-10, eight low-rank decomposition methods and others one state-of-the-art methods in CIFAR-100.

Compared to PSTRN and TR-RL, the proposed TR-VB method achieves optimal performance in terms of accuracy, number of parameters, and training time at any compression ratio, as shown in Fig. 3. Among them, TR-VB is the fastest method, at least nine times and four times faster than the PSTRN method and the TR-RL method, verified TR-VB training time cost is reduced significantly by VBMF selection TR ranks. Compared to PSTRN, TR-VB method achieves improvements in terms of parameters and Top-1 accuracy. In comparison to TR-RL, although TR-VB shows a slight decrease in Top-1 accuracy and compression ratio on CIFAR-100, TR-VB demonstrates an improvement in both Top-1 accuracy and compression ratio on CIFAR-10. Thereby validating the rationality of setting unequal ranks in TR decomposition.

Compared to state-of-the-art methods at any compression ratio for all three networks, the proposed TR-VB method achieves a higher degree of parameter compression while maintaining the classification accuracy, as shown in Fig. 4. In ResNet20-Cifar100, the base values of Top-1 accuracy of Hinge, SSS and LCT methods are 68.83%, 69.09% and 66.46%, and the Top-1 accuracy are 66.60%, 65.58% and 66.17%, respectively, while the base values of TR-VB are 65.40%. The Top-1 accuracy of TR-VB-2 and TR-VB-3 is 64.02% and 65.11%, and TR-VB method has better performance than Hinge, SSS and LCT method because TR-VB has less decline in Top-1 accuracy. In ResNet32-Cifar100, the base values of Top-1 accuracy of LCT-1 and LCT-2 methods are 68.23%, and the Top-1 accuracy are 67.76% and 68.51%, while the base values of TR-VB are 68.10%. The Top-1 accuracy of TR-VB-3 and TR-VB-4 is 67.90% and 68.49%, and TR-VB method has better performance than LCT method because TR-VB has less decline in Top-1 accuracy. Firstly, the proposed TR-VB method has significantly improved the compression ratio and Top-1 accuracy compared with TSVD, AT, Hinge, TRP, LC, SSS and TT method. Secondly, compared to CNN-FCF, ThiNet, and KSE methods, the proposed TR-VB method exhibits a slight decrease in Top-1 accuracy but achieves a significant improvement in compression ratio. In comparison to the Tucker method, although there is a slight decrease in compression ratio, the TR-VB method shows a significant improvement in Top-1 accuracy. Compared to the CUR method, while there is a slight decrease in compression ratio for the ResNet20-Cifar10 model, there is a substantial improvement in both compression ratio and Top-1 accuracy for other network model. Finally, the TR-VB method exhibits a significant improvement in both Top-1 accuracy and compression ratio when compared to the TRN method on the CIFAR-10 dataset. Similarly, on the CIFAR-100 dataset, the TR-VB method achieves a balance between compression ratio and Top-1 accuracy under various scale compression ratios. Compared to the LCCUR method, In the case of similar experimental results, although TR-VB-3 performs worse than LCCUR-2 in ResNet20-Cifar100, TR-VB achieves significant improvement in compression ratio over LCCUR-1 in ResNet20-Cifar10. Moreover, TR-VB outperforms the LCCUR method in ResNet32-Cifar10, ResNet56-Cifar10, and ResNet56-Cifar100. On the ResNet20-Cifar10, the proposed TR-VB method yielded inferior results compared to the LCT method. However, on the ResNet32-Cifar10, the TR-VB method exhibited superior performance when compared to the LCT method. On the ResNet20-Cifar10, the proposed TR-VB method achieves a balance between compression ratio and Top-1 accuracy.

4.2 Comparative Experiments and Comparison

To verify the effectiveness of TR-VB, the proposed TR-VB method conducted experiments on three networks using both the TR-VB method and the TR decomposition without VBMF on two datasets. The experimental results are shown in Tables 2 and 3(The bold font is to indicate the best performance).

Table 2 the compression results of TR-VB and TR method in CIFAR-10

Full size table

Table 3 the compression results of TR-VB and TR method in CIFAR-100

Full size table

In CIFAR-10 and CIFAR-100, TR-VB method has shown good performance in increasing compression ratio while maintaining accuracy, as compared to TR decomposition. Meanwhile, TR-VB method has reduced training time by 4%–44.8% and 2.8%–41.3% as compared to the TR decomposition method. Because VBMF selects the rank from input and output channel dimensions and configures it to each edge of the TR, TR ranks can be reasonably configuration. In addition, as compression ratio becomes smaller, the reduction in training time is more significant.

5 Conclusion

A TR-VB method is proposed to selects TR rank. and TR-VB is used to decompose all convolution layers in deep neural networks, resulting in an optimally performing compressed neural network model. TR-VB method includes two advantages, firstly, TR-VB solve the problem of unreasonable rank configuration caused by TR ranks equal. Second, TR-VB reduces training time cost during the VBMF rank selection process. Experimental results show that compared with existing TR decomposition methods, TR-VB can significantly reduce redundant parameters of deep neural networks and improve compression ratio, and reduce training time while maintaining accuracy.

References

Hou M, Tang JJ, Zhang JH, Kong WZ, Zhao QB (2019) Deep multimodal multilinear fusion with high-order polynomial pooling. Adv Neural Inf Process Syst. https://doi.org/10.1145/3458281
Article Google Scholar
Do T, Do TT, Tran H, Tjiputra E, Tran QD (2019) Compact trilinear interaction for visual question answering. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, South Korea, pp 392–401. https://doi.org/10.1109/ICCV.2019.00048
Ben-younes H, Cadene R, Cord M, Thome N (2017) MUTAN: multimodal tucker fusion for visual question answering. In: 16th IEEE international conference on computer vision (ICCV), Venice, Italy, pp 2631–2639, https://doi.org/10.1109/ICCV.2017.285
Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning (ICML), New York, USA, 2008, pp 160–167. https://doi.org/10.1145/1390156.1390177
Graves A, Mohamed AR, Hinton G (2013) Speece recognition with deep recurrent neural networks. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), Vancouver, CANADA, pp 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947
Wang WL, Gan Z, Wang WQ, Shen DH, Huang JJ, Ping W, Satheesh S, Carin L (2018) Topic compositional neural language model. In: 21st international conference on artificial intelligence and statistics (AISTATS), Lanzarote, SPAIN, pp 356–365. https://doi.org/10.48550/arXiv.1712.09783
Nanni L, Ghidoni S, Brahnam S (2017) Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recog 71:158–172. https://doi.org/10.1016/j.patcog.2017.05.025
Article ADS Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556
Deng L, Li GQ, Han S, Shi LP, Xie Y (2020) Model compression and hardware acceleration for neural networks: a comprehensive survey. Proc IEEE 108(4):485–532. https://doi.org/10.1109/JPROC.2020.2976475
Article Google Scholar
Yu HR, Zhang WW, Ji M, Zhen CH (2023) ACP: automatic channel pruning method by introducing additional loss for deep neural networks. neural Process Lett 55(2):1071–1085. https://doi.org/10.1007/s11063-022-10926-2
Article Google Scholar
Yan YC, Liu B, Lin WW, Chen YP, Li KQ, Ou JT, Fan CY (2023) MCCP: multi-collaboration channel pruning for model compression. Neural Process Lett 55(3):2777–2797. https://doi.org/10.1007/s11063-022-10984-6
Article Google Scholar
Redfern AJ, Zhu LJ, Newquist MK (2021) BCNN: a binary CNN with all matrix ops quantized to 1 bit precision. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Virtual, pp 4599–4607. https://doi.org/10.1109/CVPRW53098.2021.00518
Wang ZW, Xiao H, Lu JW, Zhou J (2021) Generalizable mixed-precision quantization via attribution rank preservation. In: 18th IEEE/CVF international conference on computer vision (ICCV), Virtual, pp 5271–5280. https://doi.org/10.1109/ICCV48922.2021.00524
Zhang Y, Xiang T, Hospedales TM, Lu HC (2018) Deep mutual learning. In: 31st IEEE/CVF conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, pp 4320–4328. https://doi.org/10.1109/CVPR.2018.00454
Jiang N, Tang JL, Yu WX (2023) Positive-unlabeled learning for knowledge distillation. Neural Process Lett 55(3):2613–2631. https://doi.org/10.1007/s11063-022-11038-7
Article Google Scholar
Zhao QJ, Sheng T, Wang YT, Tang Z, Chen Y, Cai L, Ling HB (2019) M2Det: a single-shot object detector based on multi-level feature pyramid network. In: AAAI conference on artificial intelligence (AAAI), Honolulu, HI, 2019, pp 9259–9266. https://doi.org/10.1609/aaai.v33i01.33019259
Lee D, Wang DH, Yang YK, Deng L, Zhao GS, Li GQ (2021) QTTNet: quantized tensor train neural networks for 3D object and video recognition. Neural Netw 141:420–432. https://doi.org/10.1016/j.neunet.2021.05.034
Article PubMed Google Scholar
Lu ZC, Whalen I, Boddeti V, Dhebar Y, Deb K, Goodman E, Banzhaf W (2019) NSGA-Net: neural architecture search using multi-objective genetic algorithm. In: Genetic and evolutionary computation conference (GECCO), Prague, CZECH REPUBLIC, 2019, pp 419-427. https://doi.org/10.1145/3321707.3321729
Yang ZH, Wang YH, Chen XH, Shi BX, Xu C, Xu CJ, Tian Q, Xu C (2020) CARS: continuous evolution for efficient neural architecture search. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, pp 1826–1835. https://doi.org/10.1109/cvpr42600.2020.00190
Elsken T, Metzen JH, Hutter F (2018) Efficient multi-objective neural architecture search via Lamarckian evolution, arXiv preprint arXiv:1804.09081
Astrid M, Lee SI (2017) CP-decomposition with tensor power method for convolutional neural networks compression. In: IEEE international conference on big data and smart computing (BigComp), Jeju, South Korea, 2017, pp 115–118. https://doi.org/10.1109/BIGCOMP.2017.7881725
Zhou MY, Liu YP, Long Z, Chen LX, Zhu C (2019) Tensor rank learning in CP decomposition via convolutional neural network. Signal Process-Image Commun 73:12–21. https://doi.org/10.1016/j.image.2018.03.017
Article Google Scholar
Gusak J, Kholiavchenko M, Ponomarev E, Markeeva L, Blagoveschensky P, Cichocki A, Oseledets I (2019) Automated multi-stage compression of neural networks. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, South Korea, pp 2501–2508. https://doi.org/10.1109/ICCVW.2019.00306
Liu Y, Ng MK (2022) Deep neural network compression by tucker decomposition with nonlinear response. Knowl-Based Syst 241:12. https://doi.org/10.1016/j.knosys.2022.108171
Article ADS Google Scholar
Novikov A, Podoprikhin D, Osokin A, Vetrov D (2015) Tensorizing neural networks. In: Annual Conference on neural information processing systems (NIPS), Montreal, Canada, pp 442–450. https://doi.org/10.48550/arXiv.1509.06569
Hawkins C, Zhang Z (2021) Bayesian tensorized neural networks with automatic rank selection. Neurocomputing 453:172–180. https://doi.org/10.1016/j.neucom.2021.04.117
Article Google Scholar
Qiu Yuning, Zhou Guoxu, Zhao Qibin, Xie Shengli (2022) Noisy tensor completion via low-rank tensor ring. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3181378
Article PubMed Google Scholar
Wang ML, Zhang CB, Pan Y, Xu J, Xu ZL (2019) Tensor ring restricted Boltzmann machines. In: International joint conference on neural networks (IJCNN), Budapest, Hungary, pp 14–19. https://doi.org/10.1109/IJCNN.2019.8852432
Li NN, Pan Y, Chen YR, Ding ZX, Zhao DB, Xu ZL (2022) Heuristic rank selection with progressively searching tensor ring network. Complex Intell Syst 8(2):771–785. https://doi.org/10.1007/s40747-021-00308-x
Article Google Scholar
Cheng ZY, Li BP, Fan YW, Bao YZ (2020) A novel rank selection scheme in tensor ring decomposition based on reinforcement learning for deep neural networks. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), Barcelona, Spain, pp 3292–3296. https://doi.org/10.1109/ICASSP40776.2020.9053292
Qibin Z, Guoxu Z, Shengli X, Liqing Z, Andrzej C (2016) Tensor ring decomposition, arXiv preprint arXiv:1606.05535
Nakajima S, Sugiyama M, Babacan SD, Tomioka R (2013) Global analytic solution of fully-observed variational Bayesian matrix factorization. J Mach Learn Res 14(1):1–37
MathSciNet Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Krizhevsky A (2009) Learning multiple layers of features from tiny images, Tech Report
Liu BC, Han Z, Chen XA, Shao WM, Jia HD, Wang YM, Tang YD (2022) A novel compact design of convolutional layers with spatial transformation towards lower-rank representation for image classification. Knowl-Based Syst 255:10. https://doi.org/10.1016/j.knosys.2022.109723
Article Google Scholar
Cai GY, Li JH, Liu XX, Chen ZB, Zhang HY (2023) Learning and compressing: low-rank matrix factorization for deep neural network compression. Appl Sci 13(4):22. https://doi.org/10.3390/app13042704
Article CAS Google Scholar
Li Y, Gu S, Mayer C, Gool LV, Timofte R (2020) Group sparsity: the hinge between filter pruning and decomposition for network compression. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, USA, pp 8018–8027. https://doi.org/10.1109/CVPR42600.2020.00804
Idelbayev Y, Carreira-Perpi˜n´an MA (2020) Low-rank compression of neural nets: learning the rank of each layer. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, USA, pp 8046–8056. https://doi.org/10.1109/CVPR42600.2020.00807
Elhoushi M, Tian YH, Chen ZB, Shafiq F, Yiwei Li J (2019) Accelerating training using tensor decomposition, arXiv preprint arXiv:1909.05675
Li TH, Wu BY, Yang YJ, Fan YB, Zhang Y, Liu W (2019) Compressing convolutional neural networks via factorized convolutional filters. In: 32nd IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, pp 3972–3981. https://doi.org/10.1109/CVPR.2019.00410
Wang WQ, Sun YF, Eriksson B, Wang WL, Aggarwal V (2018) Wide compression: tensor ring nets. In: 31st IEEE/CVF conference on computer vision and pattern recognition (CVPR), Salt Lake City, UT, pp 9329–9338. https://doi.org/10.1109/CVPR.2018.00972
Kim Y, Park E, Yoo S, Choi T, Yang L, Shin D (2015) Compression of deep convolutional neural networks for fast and low power mobile applications. In: 4th international conference on learning representations (ICLR), San Juan, Puerto Rico. https://doi.org/10.48550/arXiv.1511.06530
Garipov T, Podoprikhin D, Novikov A, Vetrov D (2016) Ultimate tensorization: compressing convolutional and FC layers alike, arXiv preprint arXiv:1611.03214
Xu YH, Li YX, Zhang S, Wen W, Wang BT, Dai WR, Qi YY, Chen YR, Lin WY, Xiong HK (2019) Trained rank pruning for efficient deep neural networks. In: 5th workshop on energy efficient machine learning and cognitive computing ((EMC2)/conference on neural information processing systems (NIPS), Vancouver, Canada, pp 14–17. https://doi.org/10.1109/EMC2-NIPS53020.2019.00011
Li YC, Lin SH, Zhang BC, Liu JZ, Doermann D, Wu YJ, Huang FY, Ji RR (2019) Exploiting kernel sparsity and entropy for interpretable cnn compression. In: 32nd IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, pp 2795–2804. https://doi.org/10.1109/CVPR.2019.00291
Huang ZH, Wang NY (2018) Data-driven sparse structure selection for deep neural networks. In: 15th european conference on computer vision (ECCV), Munich, Germany, pp 317–334. https://doi.org/10.1007/978-3-030-01270-0_19
Luo JH, Wu JX, Lin WY (2017) ThiNet: a filter level pruning method for deep neural network compression. In: 16th IEEE international conference on computer vision (ICCV), Venice, Italy, pp 5068–5076. https://doi.org/10.1109/ICCV.2017.541

Download references

Acknowledgements

This work was supported in part by the National Nature Science Foundation of China under Grant 62261032 and Grant 62203196, and the Key Talent Project of Gansu Province.

Author information

Min Zhang, Changhong Shi, Ning Zhang and Jie Liu have contributed equally to this work.

Authors and Affiliations

College of Electrical and Information Engineering, Lanzhou University of Technology, Lanzhou, 730050, Gansu, China
Weirong Liu, Min Zhang, Changhong Shi, Ning Zhang & Jie Liu

Authors

Weirong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Changhong Shi
View author publications
You can also search for this author in PubMed Google Scholar
Ning Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

These authors contributed equally to this work.

Corresponding author

Correspondence to Weirong Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, W., Zhang, M., Shi, C. et al. Deep Convolutional Neural Network Compression Method: Tensor Ring Decomposition with Variational Bayesian Approach. Neural Process Lett 56, 103 (2024). https://doi.org/10.1007/s11063-024-11465-8

Download citation

Accepted: 29 November 2023
Published: 13 March 2024
DOI: https://doi.org/10.1007/s11063-024-11465-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deep Convolutional Neural Network Compression Method: Tensor Ring Decomposition with Variational Bayesian Approach

Abstract

Similar content being viewed by others

Stable Low-Rank Tensor Decomposition for Compression of Convolutional Neural Network

Fast and Robust Compression of Deep Convolutional Neural Networks

Convolutional Neural Network Compression via Tensor-Train Decomposition on Permuted Weight Tensor with Automatic Rank Determination

1 Introduction