1 Introduction

In recent years, DNNs have achieved great success in various fields, such as information fusion [1,2,3], natural language processing [4,5,6], and computer vision [7]. However, DNNs require a large amount of memory and computational resources, resulting in DNNs relying heavily on high-performance computing devices. For example, VGG-16 [8] model has 140 million parameters and occupies more than 500 M of storage space. In order to construct compact DNN models with less memory and computational cost, it is essential to reduce DNN model parameters while ensuring accuracy.

DNNs compressed methods are divided into five categories according to different compression mechanisms [9]: pruning [10, 11], quantization [12, 13], knowledge distillation [14, 15], low-rank decomposition [16, 17], and network architecture search [18,19,20]. Among the 5 DNN compression methods, Low-rank decomposition achieves higher compression ratios and more efficient operations by approximating the original data with a lower-rank representation. Due to the superior multidimensional representation and higher expressive power of tensor decomposition compared to matrix decomposition, the proposed primarily focuses on tensor decomposition within the low-rank approximation method.

Tensor decomposition is a very attractive DNN model compression technique as a mathematical tool for exploring low-rank features of large-scale tensor data. Many tensor decomposition methods have been research, such as canonical polyadic (CP) decomposition [21, 22], Tucker decomposition [23, 24], tensor train (TT) decomposition[25, 26], and tensor ring (TR) [27, 28] decomposition. The TR decomposition is characterized by its ability to connect endpoints in a circular format without strict multiplication sequence of nodes, thus offering greater flexibility and enhancing the representation capability of high-order tensors. Although the TR decomposition has advantages in representation capabilities and flexibility, it is worth noting that the compression performance of the TR decomposition is determined by the TR rank.

The practice of using equal TR ranks in TR decomposition leads to unreasonable rank configuration. The current research on selecting rank is faced with the challenge of iterative processes, which can introduce a substantial computational burden and significantly extend the training time. For example, PSTRN [29] and TR-RL [30]. PSTRN [29] method progressively searches for TR ranks through a genetic algorithm. PSTRN [29] method sets the interest region related to rank and reduces the interest region after each iteration to select optimal rank. However iterative approach consumes a lot of time. The overall training time is more than 30 h.TR-RL [30] method selects rank by reinforcement learning. DDPG algorithm is used to select optimal rank of each convolutional layer, but the interaction between environment and agent will also consume a lot of time. The overall training time reaches 6 h.

A TR network compression method by Variational Bayesian (TR-VB) is proposed to address the issues of unreasonable rank configuration and long training time. TR-VB method draws inspiration from VBMF and utilizes numerical computations to determine the TR rank along input and output channel dimensions. And then instruct the TR decomposition. The proposed TR-VB makes the configuration of TR rank more reasonable and reduces time cost. TR-VB’s main contributions in this work can be summarized as follows.

  1. 1.

    A new TR-VB network compression method is proposed to solve the problem of unreasonable rank configuration caused by TR ranks equal. TR-VB method can significantly reduce the number of parameters, while ensuring accuracy.

  2. 2.

    TR-VB training time cost is reduced significantly by TR ranks selection through VBMF. Because TR-VB can automatically search for TR ranks without manual setting or an iterative approach.

The remainder of this manuscript is organized as follows. Section 2 introduces TR decomposition and VB approximation for DNN compression. Section 3 proposes the TR-VB method. Section 4 analyzes experimental results. Section 5 summarizes.

2 Related Works

2.1 Tensor Ring Decomposition

TR decomposition is formed by a series of connected third-order tensor nodes in circular structure, which resolved the strict sequential order required for the multi-linear products of cores in TT decomposition and improved representation ability and flexibility of TT decomposition [31]. Specifically, let \({\mathcal{W}}\) be a \(d\) th-order tensor of size \(n_{1} \times n_{2} \times \cdots \times n_{d}\), denoted by \({\mathcal{W}} \in {\mathbf{R}}^{{n_{1} \times n_{2} \times \cdots \times n_{d} }}\), TR representation is to decompose it into a sequence of latent tensors \({\mathcal{Z}}_{k} \in R^{{r_{k} \times n_{k} \times r_{k + 1} }} ,k = 1, \cdots ,d\), TR decomposition can be expressed in index form given by,

$$ {\mathcal{W}}(i_{1} , \cdots ,i_{d} ) = \sum\limits_{{\alpha_{1} , \cdots ,\alpha_{d} = 1}}^{{r_{1} \cdots r_{{\text{d}}} }} {\prod\limits_{k = 1}^{d} {{\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} )} } $$
(1)

where \(\left\{ {r_{1} \cdots r_{d + 1} } \right\}\) denotes TR ranks, each node \({\mathcal{Z}}_{k} \in R^{{r_{k} \times n_{k} \times r_{k + 1} }}\) is a 3rd-order tensor and \(r_{d + 1} = r_{1}\).

2.2 Variational Bayesian Approximation

Variational Bayesian (VB) approximate method has been successfully applied to matrix factorization (MF), and offer the advantage of automatically determining the dimensions of principal component. However, many existing methods implement VB approximate through local search algorithms which obtained through standard procedures. In contrast, Variational Bayesian Matrix Factorization (VBMF) through numerical analysis to determine the global solution of the cost function [32]. Specifically, the global solution in VBMF is obtained from reweighted singular value decomposition (SVD) of the observed matrix, where each weight is derived by solving a quartic equation with coefficients derived from the observed singular value functions. Unlike standard iterative algorithms, VBMF computes the global optimal solution without iterations. Additionally, an empirical VB scenario is considered in which hyperparameters (prior variance) are learned from the data. GAS of EVBMF eliminates the need for manual parameter tuning and facilitates calculation the global optimal solution [32].

3 The Proposed TR-VB

A TR-VB method based on the numerical calculation of GAS of EVBMF is proposed. TR-VB method differs from setting equal TR ranks in TR decomposition. TR-VB approach configures TR ranks in tensor decomposition more reasonably, resulting in improved of the TR network performance. As shown in Fig. 1, TR-VB consists of three parts: rank selection, TR decomposition, and fine-tuning of compression model. Firstly, rank is selected by GAS of EVBMF after unfolding 4-D tensor of convolution layer parameters into 2-D matrix, as shown in the orange box in Fig. 1. Secondly, TR-VB uses TR decomposition to decompose convolution layer parameters after block processing, as shown in the blue box in Fig. 1. Thirdly, fine-tuning is designed to reduce performance loss caused by TR decomposition on compressed model, as shown in the green box in Fig. 1.

Fig. 1
figure 1

The overview of tensor ring network compression method by Variational Bayesian (TR-VB), the orange box denotes rank selection, the blue box denotes TR decomposition and the green box denotes fine-tuning. The TR-VB method utilizes the GAS of EVBMF to estimate the ranks of parameter tensors in convolutional layers, then the selected ranks are employed in TR decomposition to compress the redundant parameters within convolutional layers. Finally, the compressed model is fine-tuned on the training set to restore the performance of the network model after TR decomposition, resulting in the network compression model

3.1 Rank Selection

To address the time efficiency problem caused by rank selection through iteration, TR-VB leverages the GAS of EVBMF method to automatically determine the rank for matrix decomposition. The selection process consists of three steps: firstly, unfold convolution layer parameters from a 4-D tensor to a 2-D matrix. Secondly, the selection rank of convolution layer parameters. Finally, slackness and retrencher selection rank. The following will use the form of an itemized list to explain the specific design details of the TR-VB method.

1. In the tensor ring, the rank of the latent tensor is indeed related to the latent tensor itself. To select the rank of latent tensors in the tensor ring, the parameters of convolutional layers need to unfold from 4-D tensors into 2-D matrices.

First, for the sake of convenience, the 4-D tensor \({\mathcal{W}}\) of the convolution layer parameter is unfolding into a 3-D tensor \({\mathcal{G}}\) according to filter space dimension, as shown in Eq. (2),

$$ {\mathcal{G}} = \left[ {\begin{array}{*{20}c} {{\mathcal{W}}(1,1,:,:)} & {{\mathcal{W}}(1,2,:,:)} & \cdots & {{\mathcal{W}}(1,C_{out} ,:,:)} \\ {{\mathcal{W}}(2,1,:,:)} & {{\mathcal{W}}(2,2,:,:)} & \cdots & {{\mathcal{W}}(2,C_{out} ,:,:)} \\ \vdots & {} & \ddots & \vdots \\ {{\mathcal{W}}(C_{in} ,1,:,:)} & {{\mathcal{W}}(C_{in} ,2,:,:)} & \cdots & {{\mathcal{W}}(C_{in} ,C_{out} ,:,:)} \\ \end{array} } \right] \in {\mathbf{R}}^{{C_{in} \times C_{out} \times (kk)}} $$
(2)

where \({\mathcal{G}}\) denotes 3-D parameter tensor obtained after unfolding, \({\mathcal{W}}\) denotes 4-D parameter tensor of each layer before unfolding, \(C_{in}\),\(C_{out}\) and \(k\) represents the input channel dimension, output channel dimension, and convolution kernel spatial dimension in a convolutional layer, respectively.

Then, due to the decomposition performed by TR-VB between the input and output channels, the 3-D tensor \({\mathcal{G}}\) is unfolded to 2-D matrix \({\mathcal{G}}_{{{{in}}}}\), \({\mathcal{G}}_{{{{out}}}}\), as shown in Eqs. (3) and (4),

$$ {\mathcal{G}}_{{{{in}}}} = \left[ {{\mathcal{G}}(:,:,1),{\mathcal{G}}(:,:,2), \cdots ,{\mathcal{G}}(:,:,n_{3} )} \right] \in R^{{C_{in} \times (C_{out} kk)}} $$
(3)
$$ {\mathcal{G}}_{{{{out}}}} = \left[ {{\mathcal{G}}(:,:,1)^{\text{T}} ,{\mathcal{G}}(:,:,2)^{\text{T}} , \cdots ,{\mathcal{G}}(:,:,n_{3} )^{\text{T}} } \right] \in R^{{C_{out} \times (C_{in} kk)}} $$
(4)

where \({\mathcal{G}}_{{{{in}}}}\) and \({\mathcal{G}}_{{{{out}}}}\) denote 2-D matrix obtained by unfolding convolution layer parameter tensor in input channel dimension and output channel dimension.

2. Because the advantage of the GAS of EVBMF method lies in its ability to effectively reduce the computational time required for numerical calculations rank, TR-VB selects TR rank \(\left[ {\hat{R}_{{{{in}}}} ,\hat{R}_{{{{out}}}} } \right]\) through analytical calculation to reduce the parameter redundancy in DNNs, as shown in Eq. (5),

$$ \left[ {\hat{R}_{{{{in}}}} ,\hat{R}_{{{{out}}}} } \right] = g_{{{{evbmf}}}} ({\mathcal{G}}_{{{{in}}}} ,{\mathcal{G}}_{{{{out}}}} ) $$
(5)

where \(\hat{R}_{{{{in}}}}\) and \(\hat{R}_{{{{out}}}}\) denote rank of input channel dimension and output channel dimension, and \(g_{{{{evbmf}}}} ()\) denotes estimation parameter tensor rank using GAS of EVBMF.

3. Slacking selection rank \(\widetilde{R}_{{{{in}}}}\), \(\widetilde{R}_{{{{out}}}}\) to improve information carrying capacity of decomposed tensor. Owing to the loss of some high-dimensional structural information when unfolding a 4D tensor into a 2D tensor, it becomes necessary to slack the selected ranks to enhance the information-bearing capacity of the decomposed tensor. As TR-VB aim to control the slack factor between 0 and 1, TR-VB employ the concept of normalization and obtain the slack ranks via Eqs. (6) and (7),

$$ \widetilde{R}_{{{{in}}}} = \hat{R}_{{{{in}}}} + k_{{{{in}}}} (C_{{{{in}}}} - \hat{R}_{{{{in}}}} ) $$
(6)
$$ \widetilde{R}_{{{{out}}}} = \hat{R}_{{{{out}}}} + k_{{{{out}}}} (C_{{{{out}}}} - \hat{R}_{{{{out}}}} ) $$
(7)

where \(k_{{{{in}}}}\) and \(k_{{{{out}}}}\) denote the slackness coefficient of tensor rank in dimension of input channel and output channel, \(0 < k_{{{{in}}}} < 1\), \(0 < k_{{{{out}}}} < 1\), \(\widetilde{R}_{{{{in}}}}\) and \(\widetilde{R}_{{{{out}}}}\) denote rank in the dimension of input channel and output channel after slackness.

The excessive slackness of rank \({\text{R}}_{{{{in}}}}\), \({\text{R}}_{{{{out}}}}\) is tightened to achieve a reasonable match of in the TR rank. Retrenching coefficient is introduced after slackness of rank, as shown in Eqs. (8) and (9),

$$ {\text{R}}_{{{{in}}}} = t_{{{{in}}}} \widetilde{R}_{{{{in}}}} $$
(8)
$$ {\text{R}}_{{{{out}}}} = t_{out} \widetilde{R}_{{{{out}}}} $$
(9)

where \(t_{{{{in}}}}\) and \(t_{out}\) denote retrenching coefficient of the tensor rank in dimension of input channel and output channel, \(0 < t_{{{{in}}}} < 1\), \(0 < t_{out} < 1\), \({\text{R}}_{{{{in}}}}\) and \({\text{R}}_{{{{out}}}}\) denote rank in the dimension of input channel and output channel after retrenching.

3.2 Tensor Ring Decomposition

The section introduces the TR decomposition process of parameters. It is known that \({\mathcal{W}}_{{{i}}}\) denotes 4-D parameter tensor of each layer before unfolding, the convolution layer operations are shown in Eq. (10),

$$ {\mathcal{Y}}_{{{i}}} = h({\mathcal{X}}_{{{i}}} ;{\mathcal{W}}_{{{i}}} ) $$
(10)

where \({\mathcal{X}}_{{{i}}}\) and \({\mathcal{Y}}_{{{i}}}\) denote input and output characteristics of convolution layer, \(h( \bullet ; \bullet )\) denotes the operation of convolutional layer from input feature to output feature.

1. To reduce parameters in DNNs, TR-VB divide the input and output channel dimensions into blocks to form three latent tensors in the tensor ring, along with two latent tensors from filter dimensions, amounting to a total of eight latent tensors. Therefore, the high-order tensor \({\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{8} )\) consisting of eight nodes, as shown in Eq. (11),

$$ {\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{8} ) = \sum\limits_{{\alpha_{1} , \cdots ,\alpha_{8} = 1}}^{{r_{1} \cdots r_{8} }} {\prod\limits_{k = 1}^{8} {{\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} )} } $$
(11)

where \({\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{8} )\) denotes \((i_{1} , \cdots ,i_{8} )\) th element of tensor, \(\left\{ {r_{1} \cdots r_{9} } \right\}\) denote TR ranks, each node \({\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} ) \in R^{{r_{k} \times n_{k} \times r_{k + 1} }}\) is a 3rd-order tensor and \(r_{9} = r_{1}\). \(\alpha_{k}\) is index of latent dimensions.

2. Due to the small size of convolutional kernel, the parameter compression ratio is not significant during the decomposition process, so two 3rd-order tensors of the filter space dimension are contracted. the contracted high-order (or high-dimensional) tensor \({\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{7} )\) as shown in Eq. (12),

$$ {\mathcal{W}}(i_{1} , \cdots ,i_{7} ) = \sum\limits_{{\alpha_{1} , \cdots ,\alpha_{7} = 1}}^{{r_{1} \cdots r_{7} }} {\prod\limits_{k = 1}^{7} {{\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} )} } $$
(12)

where \({\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{7} )\) denotes \((i_{1} , \cdots ,i_{7} )\) th element of tensor, \(\left\{ {r_{1} \cdots r_{8} } \right\}\) denote TR ranks, each node \({\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} ) \in R^{{r_{k} \times n_{k} \times r_{k + 1} }}\) is a 3rd-order tensor and \(r_{8} = r_{1}\). \(\alpha_{k}\) is index of latent dimensions.

3. The selected rank is configured in TR decomposition. Set the first four ranks to input channel selection rank \(r_{1} = {\text{r}}_{2} = r_{3} = r_{4} = {\text {R}}_{in}\) in TR, and the last three ranks to output channel rank \(r_{5} = r_{6} = r_{7} = {\text {R}}_{out}\) in TR, Higher order tensors after configuration of rank \({\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{7} )\) as shown in Eq. (13) and Fig. 2,

$$ \left\{ \begin{gathered} {\mathcal{W}}_{i} (i_{1} , \cdots ,i_{7} ) = \sum\limits_{{\alpha_{1} , \cdots ,\alpha_{7} = 1}}^{{r_{1} \cdots r_{7} }} {\prod\limits_{k = 1}^{7} {{\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} )} } \hfill \\ {\mathcal{Z}}_{k} (\alpha_{k} ,i_{k} ,\alpha_{k + 1} ) \in R^{{r_{k} \times n_{k} \times r_{k + 1} }} ,r_{8} = r_{1} \hfill \\ r_{1} = {\text{r}}_{2} = r_{3} = r_{4} = R_{in} \hfill \\ r_{5} = r_{6} = r_{7} = R_{out} \hfill \\ \end{gathered} \right. $$
(13)

where \({\text{R}}_{{{{in}}}}\) and \({\text{R}}_{{{{out}}}}\) denote rank selected by input channel and output channel dimensions of the parameter tensor.

Fig. 2
figure 2

The selected rank is configured in TR decomposition. The blue and purple areas in the figure represent the latent tensor obtained after being divided into blocks according to the input and output channel dimensions. The yellow area represents the latent tensor obtained after contraction. (Color figure online)

According to the tensor \({\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{7} )\) obtained by TR decomposition, the compressed convolutional layer is constructed, as shown in Eq. (14),

$$ {\mathcal{Y}}_{{{i}}} = h({\mathcal{X}}_{{{i}}} ;{\mathcal{W}}_{{{i}}} (i_{1} , \cdots ,i_{7} )) $$
(14)

3.3 Fine-Tune

Because the compression of DNNs will cause some performance loss, it is necessary to recover the compressed model accuracy by fine-tuning. Fine-tuning is to retrain compressed DNN with several epochs. The fine-tuning objective function is shown in Eq. (15),

$$ {\mathcal{W}}^{*} { = }\mathop {\arg \min }\limits_{{{\tilde{\mathcal{W}}}}} \sum\limits_{k = 1}^{N} {\ell (f({\mathcal{X}}_{k} ;{\tilde{\mathcal{W}}}),{\mathcal{Y}}_{k} )} $$
(15)

where \(\left\{ {\left( {{\mathcal{X}}_{k} ,{\mathcal{Y}}_{k} } \right)} \right\}_{k = 1}^{N}\) denotes training samples, \(N\) denotes training samples. \(\ell ( \bullet , \bullet )\) denotes loss function for training, which is used to measure deviation between DNN output and label. \({\mathcal{W}}^{*}\) denotes parameters of compressed DNN after fine-tuning.

The pseudocode of TR-VB algorithm is shown as below. In the first step, TR-VB method unfold the four-dimensional tensor and select the rank by GAS of VBMF. In the second step, compressing the neural network layer via TR decomposition using the rank obtained in the previous step. In the third step, further fine-tuning the compressed model to overcome some of the performance loss due to compression.

Algorithm 1 Pseudocode (TR-VB)

figure a

4 Experiments and Discussions

Experiments in image classification task are designed to demonstrate generalization, advancement and effectiveness of the proposed TR-VB.

Firstly, to evaluate the generalization capability of TR-VB, TR-VB are evaluated on Resnet20, Resnet32, and Resnet56 [33] models using the CIFAR-10 [34] and CIFAR-100 [34] datasets. Both CIFAR10 and CIFAR100 datasets consist of 50,000 train images and 10,000 test images with size as 32 × 32 × 3. CIFAR10 has 10 object classes and CIFAR100 has 100 categories.

Secondly, TR-VB’s result are compared with 15 state-of-the-art compression methods to validate the advancement of TR-VB, which contains following three categories: (1)TR network compression methods based on other rank selection, such as TR-RL [30] base on reinforcement learning and PSTRN-M/-S [29] base on progressively searching tensor ring network, (2)Low-rank decomposition methods, such as a novel compact design method for convolutional layers with spatial transformations—LCT [35], a method based on enumeration for solving the optimal rank and CUR decomposition—LCCUR [36], use sparsity-inducing matrices to Hinge [37] filter pruning and decomposition, an alternating optimization framework that integrates learning and compression (LC [38]) algorithms, enabling simultaneous network training and network compression, an AT [39]approach to use tensor decomposition to reduce training time of training a model from scratch, a CNN model with factorized convolutional filters (CNN-FCF [40]) by updating the standard filter using back-propagation, TRN [41] method base on TR decomposition, Tucker [42] and TT [43], (3)Other state-of-the-art methods, such as alternates between low rank approximation and training—TRP [44], a kernel sparsity and entropy (KSE [45]) indicator is proposed to quantitate the feature map importance in a feature-agnostic manner to guide model compression, a sparse structure selection (SSS [46]) effective framework to learn and prune deep models in an end-to-end manner and a filter level pruning method for deep neural network compression-ThiNet [47].

Finally, comparative experiments on with or without a GAS of VBMF to verify the effectiveness of GAS of EVBMF.

The experimental parameters are configured as shown in Table 1. Experiments are initialized from randomly initialized models. TR-VB is trained via SGD with momentum 0.8 and a weight decay of 1 × 10−4 on mini-batches of size 128. The random seed is set to 0. The loss function is cross entropy. TR-VB is trained for 200 epochs with an initial learning rate as 0.1.

Table 1 Parameters configuration of Resnet20, Resnet32 and Resnet56 training

The hardware configuration of the experiments is Intel(R) Core (TM) i7-10,700 CPU and NVIDIA GeForce GTX 2080Ti GPU. Software development tools mainly include PyTorchv1.1.0 and Tensorlyv0.4.5.

4.1 Image Classification Experiments and Comparison

In order to validate generalization and advancement of TR-VB, the results of compressing classification networks such as Resnet20, Resnet32 and Resnet56 on CIFAR-10 and CIFAR-100 dataset are reported, the network models are obtained at different compression ratios (each subfigure represents the results obtained at different compression ratios for the same network in Fig. 4).

Figure 3 show the comparison results of TR-VB with two advanced TR network compression methods. Figure 4 show eight low-rank decomposition methods and others third state-of-the-art methods in CIFAR-10, eight low-rank decomposition methods and others one state-of-the-art methods in CIFAR-100.

Fig. 3
figure 3

Quantitative comparisons of the TR-VB (orange circle) with two advanced TR methods, i.e., PSTRN (blue circle) and TR-RL (gray circle) on CIFAR-10 and CIFAR-100. The horizontal, vertical coordinates and circle dimensions represent compression ratio, Top-1 accuracy and training time, respectively. The smaller the circle size, the better. (Color figure online

Fig. 4
figure 4

Quantitative comparisons of the TR-VB with others’ eleven state-of-the-art methods, i.e., TRN, CNN-FCF, LCCUR, TRP, LC, LCT, TSVD, CUR, ThiNet, KSE and AT on CIFAR-10. Quantitative comparisons of the TR-VB with others’ nine state-of-the-art methods, i.e., TRN, LCCUR, Hinge, SSS, LCT, TSVD, CUR, Tucker and TT on CIFAR-100. The horizontal, principal vertical coordinates and auxiliary vertical coordinates represent state-of-the-art methods, Top-1 accuracy (line chart) and compression ratio (bar chart), respectively. The TR-VB method at different compression ratio scales is on the far right of each picture

Compared to PSTRN and TR-RL, the proposed TR-VB method achieves optimal performance in terms of accuracy, number of parameters, and training time at any compression ratio, as shown in Fig. 3. Among them, TR-VB is the fastest method, at least nine times and four times faster than the PSTRN method and the TR-RL method, verified TR-VB training time cost is reduced significantly by VBMF selection TR ranks. Compared to PSTRN, TR-VB method achieves improvements in terms of parameters and Top-1 accuracy. In comparison to TR-RL, although TR-VB shows a slight decrease in Top-1 accuracy and compression ratio on CIFAR-100, TR-VB demonstrates an improvement in both Top-1 accuracy and compression ratio on CIFAR-10. Thereby validating the rationality of setting unequal ranks in TR decomposition.

Compared to state-of-the-art methods at any compression ratio for all three networks, the proposed TR-VB method achieves a higher degree of parameter compression while maintaining the classification accuracy, as shown in Fig. 4. In ResNet20-Cifar100, the base values of Top-1 accuracy of Hinge, SSS and LCT methods are 68.83%, 69.09% and 66.46%, and the Top-1 accuracy are 66.60%, 65.58% and 66.17%, respectively, while the base values of TR-VB are 65.40%. The Top-1 accuracy of TR-VB-2 and TR-VB-3 is 64.02% and 65.11%, and TR-VB method has better performance than Hinge, SSS and LCT method because TR-VB has less decline in Top-1 accuracy. In ResNet32-Cifar100, the base values of Top-1 accuracy of LCT-1 and LCT-2 methods are 68.23%, and the Top-1 accuracy are 67.76% and 68.51%, while the base values of TR-VB are 68.10%. The Top-1 accuracy of TR-VB-3 and TR-VB-4 is 67.90% and 68.49%, and TR-VB method has better performance than LCT method because TR-VB has less decline in Top-1 accuracy. Firstly, the proposed TR-VB method has significantly improved the compression ratio and Top-1 accuracy compared with TSVD, AT, Hinge, TRP, LC, SSS and TT method. Secondly, compared to CNN-FCF, ThiNet, and KSE methods, the proposed TR-VB method exhibits a slight decrease in Top-1 accuracy but achieves a significant improvement in compression ratio. In comparison to the Tucker method, although there is a slight decrease in compression ratio, the TR-VB method shows a significant improvement in Top-1 accuracy. Compared to the CUR method, while there is a slight decrease in compression ratio for the ResNet20-Cifar10 model, there is a substantial improvement in both compression ratio and Top-1 accuracy for other network model. Finally, the TR-VB method exhibits a significant improvement in both Top-1 accuracy and compression ratio when compared to the TRN method on the CIFAR-10 dataset. Similarly, on the CIFAR-100 dataset, the TR-VB method achieves a balance between compression ratio and Top-1 accuracy under various scale compression ratios. Compared to the LCCUR method, In the case of similar experimental results, although TR-VB-3 performs worse than LCCUR-2 in ResNet20-Cifar100, TR-VB achieves significant improvement in compression ratio over LCCUR-1 in ResNet20-Cifar10. Moreover, TR-VB outperforms the LCCUR method in ResNet32-Cifar10, ResNet56-Cifar10, and ResNet56-Cifar100. On the ResNet20-Cifar10, the proposed TR-VB method yielded inferior results compared to the LCT method. However, on the ResNet32-Cifar10, the TR-VB method exhibited superior performance when compared to the LCT method. On the ResNet20-Cifar10, the proposed TR-VB method achieves a balance between compression ratio and Top-1 accuracy.

4.2 Comparative Experiments and Comparison

To verify the effectiveness of TR-VB, the proposed TR-VB method conducted experiments on three networks using both the TR-VB method and the TR decomposition without VBMF on two datasets. The experimental results are shown in Tables 2 and 3(The bold font is to indicate the best performance).

Table 2 the compression results of TR-VB and TR method in CIFAR-10
Table 3 the compression results of TR-VB and TR method in CIFAR-100

In CIFAR-10 and CIFAR-100, TR-VB method has shown good performance in increasing compression ratio while maintaining accuracy, as compared to TR decomposition. Meanwhile, TR-VB method has reduced training time by 4%–44.8% and 2.8%–41.3% as compared to the TR decomposition method. Because VBMF selects the rank from input and output channel dimensions and configures it to each edge of the TR, TR ranks can be reasonably configuration. In addition, as compression ratio becomes smaller, the reduction in training time is more significant.

5 Conclusion

A TR-VB method is proposed to selects TR rank. and TR-VB is used to decompose all convolution layers in deep neural networks, resulting in an optimally performing compressed neural network model. TR-VB method includes two advantages, firstly, TR-VB solve the problem of unreasonable rank configuration caused by TR ranks equal. Second, TR-VB reduces training time cost during the VBMF rank selection process. Experimental results show that compared with existing TR decomposition methods, TR-VB can significantly reduce redundant parameters of deep neural networks and improve compression ratio, and reduce training time while maintaining accuracy.