1 Introduction

In the past decade, deep learning has achieved remarkable breakthroughs in a large variety of applications, such as image classification [25, 40], object detection [37, 38] and semantic segmentation [20, 29]. Meanwhile, the architecture of deep convolutional neural networks (CNNs) has evolved for years. ResNet [12] is a milestone in the development of neural network architectures by introducing shortcut connections to ease optimization in training very deep networks; however it utilizes a large number of parameters. To alleviate the huge computational burden from ResNet, a more computationally efficient DenseNet is proposed [17]. Different from ResNet of combining features through summation, DenseNet combines feature from concatenation, therefore it can be trained more efficiently. However, some redundancy still lies in there, especially in bottleneck layers. Visualization of connectivity pattern of DenseNet shows that later layers tend to prefer more recently learned features and some features from very early layers [17]. Deep roots also claims it is unlikely that every filter in a neural network relies on the output of all the filters in the previous layer [18]. These observations lead to sparsely connected architectures like LogDenseNet [14] and SparseNet [57], which use a sparsely “log-offset” connectivity pattern to improve DenseNet’s efficiency in terms of parameters. However, this pre-defined connectivity pattern is not flexible. To overcome this shortcoming, Huang et al. propose CondenseNet [16], a novel network architecture to gradually remove less important connections in bottleneck layers. Based on the observation that different convolutional groups in CondenseNet prunes filters independently and inspired by the fact that exclusive lasso regularization brings sparsity at intra-group level [22], we insert exclusive lasso penalty into CondenseNet to learn more diversified features. Our proposed model is denoted as CondenseNet-elasso. Our method applies to scenarios whose network backbone composed of stacks of dense blocks. It is also promising to study to combine our proposed method to applications like medical images [21], graph convolution networks [7] and body pose prediction [46].


Our Contribution In this paper, we propose an exclusive lasso regularizer on the learned group convolutional layer in CondenseNet to decorrelate filters between different convolutional groups, therefore alleviating the neural network’s overfitting problem. This method can reduce filter co-dependence between different groups. We validate CondenseNet-elasso of varying depths on our proposed method on three public datasets and medical images (Appendix B). The experimental results demonstrate that it achieves better performance under the same computation budget compared with CondenseNet and achieves much better performance compared with other DenseNet’s variants.


Outline of the paper Section 2 gives a brief review of related works. Section 3 describes DenseNet and CondenseNet which our method is based on. Section 4 devotes to introduce our proposed CondenseNet-elasso. Next, in Sect. 5, a large set of experiments are carried out to examine the performance of CondenseNet-elasso. To be specific, Sect. 5.3 shows classification results on CIFAR and Tiny ImageNet on models of different scales. Section 5.4 validates our assumption on why the exclusive lasso regularization works. Section 5.6 compares our model with other group convolution variants. Finally, we conclude this paper in Sect. 6.

2 Related work

In this section, we review related works on network pruning methods, group convolutions, neural network regularization and exclusive lasso.


Network pruning Deep networks often have a large number of redundant weighs that can be pruned without sacrificing accuracy [6]. Han et al. propose an iterative process to remove unimportant connections based on \(l_{1}\)-norm and \(l_{2}\)-norm, followed by fine-tuning to recover model accuracy [10]. Hu et al. claim that neurons with high APoZ (average percentage of zeros) are redundant and can be pruned without affecting the overall performance of the network [15]. Structured Sparsity Learning (SSL) is proposed to regularize the structures (filter, filter shapes and layer depth) of DNNs [45]. ThiNet uses the reconstruction error of the of the next layer to measure the importance of filters in the current layer [31]. Li et al. propose a data free filter selection criterion to use the \(l_{1}\)-norm as the importance measurement [26]. This method can prune multiple layers at once based on a layer level sensitivity analysis. He et al. propose the soft filter pruning method which allows pruned filters to recover to nonzero through backpropagation [13]. Yu et al. measure neuron importance to minimize the reconstruction error in the pre-softmax layer. The paper gives out a closed form solution to calculate the “importance score” in earlier layers by propagating back from the “final response layer” [51].


Neural network regularization Many works on neural network regularization have been proposed to benefit generalization. Dropout [41] and dropconnect [42] introduce regularization of deep networks through randomly setting a subset of activations or weights to zero during training. Cogswell et al. propose a regularizer that encourages non-redundant or diverse representations in DNNs through minimizing the cross-covariance of hidden activations [4]. Changpinyo et al. propose to train models in an incrementally manner by starting with a network only contains a small fraction of connections and add connections over time [2].

Group lasso regularization has been introduced into deep neural networks to obtain highly compact networks [39]. However, there exists a strong filter correlation among the convolutional filters trained by group lasso constraints. Based on this observation, a decorrelation regularization method was proposed to weaken the correlation between different filters and achieve a more compact model [58]. In the meantime, a sparsity-inducing regularizer called GrOWL (group ordered weighted \(l_{1}\)) was introduced, which not only eliminates unimportant neurons but also identifies heavily correlated neurons by setting the corresponding weights to a universal value [52].


Group convolution Group convolution has been widely used in designing efficient networks. It was first introduced in AlexNet due to a lack of GPU memory by partitioning inputs into G mutually exclusive groups, each producing its own output [25]. As a result, the number of parameters and FLOPs reduce to 1/G of standard convolution. Meanwhile, ResNeXt investigates the trade-off between depth, width and the number of groups [47]. The paper suggests that a larger number of groups leads to better accuracy under similar computational costs. Ioannou et al. make a further study on group convolution and propose Deep roots, which uses filters groups to force the network to learn filers with only limited dependence on previous layers [18].

Some researchers also propose learnable group convolutions to make the group convolution more flexible. Peng et al. use the low-rank decomposition to approximate the weight matrix W into the product of two matrices D and P, in which D is a block diagonal matrix which can be turned into a group convolution and P is a \(1\times 1\) convolution [35]. Later on, fully learnable group convolution (FLGC) is proposed to learn a binary selection matrix for input channels and groups, including input channel-group connectivity and filter-group connectivity [44]. One disadvantage of this method is that the final group convolution does not have the same number of input channels in each group and therefore it is hard to implement in some deep learning frameworks (like PyTorch). Besides, FLGC forces each input channel to be selected into the group with the maximum probability. However, it may not be desirable since some important input channels can be shared among different convolutional groups. Meanwhile, Zhang et al. propose a dynamic grouping method, which can learn group number and channel connections simultaneously through a relationship matrix [55]. To reduce the number of learnable parameters in the relationship matrix, the authors decompose it to a series of kronecker product of smaller matrices. Dynamic grouping convolution can learn the number of groups in each layer in an end-to-end manner; however it requires the number of convolutional filters to be the power of two, besides, some group connectivity patterns cannot be represented by kronecker product of small matrices. Recently, Guo et al. propose a self-grouping convolutional neural networks (SG-CNN) who generate groups based on a clustering method, resulting in similar filters within groups and diverse filters between groups [9]. However, the resulting convolution groups have an uneven number of filters inside each group which leads to inefficient implementation.


Exclusive lasso Exclusive lasso penalty (\(l_{1,2}\)-norm) [56] was first introduced in a multi-task feature selection problem. This regularizer introduces competition among different tasks for the same feature. An empirical way to analyze the behavior of a penalty is to visualize its corresponding isosurface [39]. An illustration example is shown in Fig. 1c. The figure shows two special cases of exclusive lasso: if each parameter is in its own group, the penalty is equivalent to \(l_{2}\) penalty while if all parameters are in the same group, the penalty is equivalent to the square of \(l_{1}\) norm.

Later on, Kong et al. propose the exclusive group lasso leading to sparsity at an intra-group level [22]. In deep learning context, Yoon and Hwang first use exclusive lasso penalty as a regularization method in neural networks [50]. The paper employs group lasso to promote inter-group sparsity and exclusive lasso to enforce intra-group sparsity. They use a weighted version of these two regularizers to achieve feature sharing and feature competition simultaneously. the resulting model is denoted as CGES (combined group and exclusive sparsity). Exclusive sparsity helps the network converge faster, learn less redundant features, and make each group be as different as possible. Different from CGES, our model achieves a group-level sparsity while CGES achieves a filter-level sparsity. This is because our model prunes filters based on the condensation criterion while CGES prunes filters based on group lasso and exclusive lasso regularizer.

Fig. 1
figure 1

Unit balls for different regularization terms. let \(W = (w_{1}^{1}, w_{2}^{1}, w_{1}^{2})\) where superscript denotes group number and subscript denotes index within a group. In this case, \(w_{1}^{1},w_{2}^{1},w_{1}^{2}\) are variables along x-, y- and z-axes, respectively. The penalty is defined as \(\varOmega =(\left| w_{1}^{1}\right| +\left| w_{2}^{1}\right| )^{2}+\left| w_{1}^{2}\right| ^{2}\). a Considering variables in different groups (\(w_{1}^{1}\) and \(w_{1}^{2}\)) by setting \(w_{2}^{1}=0\), this yields the ball generated by \(l_{2}\)-norm. The \(l_{2}\) unit ball is a sphere, which does not favor any of the variables. b Considering variables in the same group (\(w_{1}^{1}\) and \(w_{2}^{1}\)) by setting \(w_{1}^{2}=0\), this yields a unit ball generated by \(l_{1}\)-norm. The \(l_{1}\) unit ball is a regular octahedron surface, enforcing sparsity between different variables. c Unit ball for \(l_{1,2}\) penalty. \(l_{1,2}\) unit ball enforcing sparsity inside group and diversity between groups

3 DenseNet and CondenseNet

In this section, we describe DenseNet [17] and CondenseNet [16] on which our method is based.

3.1 DenseNet

Fig. 2
figure 2

a DenseNet basic building block. b DenseNet-bottleneck basic building block

The distinguishing property of DenseNet is that each layer receives a concatenation of all feature maps that are generated by all preceding layers within the same dense block as its input. In particular, the lth layer receives the feature maps of all preceding layers,namely,

$$\begin{aligned} x_{l}=H_{l}([x_{0},x_{1},...,x_{l-1}]) \end{aligned}$$
(1)

where \([x_{0},x_{1},...,x_{l-1}]\) refers to the concatenation of the feature maps produced in the previous l layers. There are two architectures in DenseNet [17]. One is DenseNet whose basic building block is a \(3\times 3\) convolutional layer (Fig. 2a). The other one is DenseNet-bottleneck (DenseNet-BC) whose basic building block is composed of one \(1\times 1\) bottleneck layer followed by one \(3\times 3\) convolutional layer (Fig. 2b). In this paper, we mainly focus on pruning bottleneck layers, therefore we use DenseNet as an abbreviation for DenseNet-BC in later discussions.

3.2 CondenseNet

To learn a good connectivity pattern automatically, CondenseNet was proposed on the basis of DenseNet [16]. Specifically, the authors design a new basic building block, Learned Group Convolution, by splitting the filters of bottleneck layer into multiple groups and gradually remove less important features during training. CondenseNet improves DenseNet’s efficiency in terms of number of parameters and floating-point operations (FLOPs). The most prominent characteristic of CondenseNet is that the final pre-trained model can be converted into standard group convolutions which brings in actual acceleration at deployment.

Here, we first introduce some notations to facilitate the discussions in this section. Standard convolutional layer generates O output feature maps by applying O convolutional filters over R input feature maps. For bottleneck layers with kernel size one, a 4D weight tensor is simplified to a 2D matrix. For each convolutional layer, the kernel is divided to G groups, denoted as \(\varvec{W}^{1}, \varvec{W}^{2},...,\varvec{W}^{G}\) where G is a pre-defined number, in which \(\varvec{W}^{g}\) is of size \(\frac{O}{G}\times {R}\). We use the symbol \(W_{i,j}^{g,l}\) to represent the weight of the jth input for the ith output within group g in layer l. In what follows, we will introduce some key components in CondenseNet.


Network DenseNet’s basic building layer is composed of one 1 × 1 convolutional layer followed by one 3 × 3 convolutional layer. CondenseNet replaces its first layer with learned group convolution and replaces its second layer with group convolution. DenseNet adds k new feature maps at each block, which is referred as growth rate. Meanwhile, CondenseNet uses an “exponentially increasing growth rate” schedule: the growth rate doubles when the feature map size downsamples. To encourage feature reuse, CondenseNet removes 1 × 1 convolutional layers for channel reduction in transition blocks between stages in DenseNet. If not, the resulting model is called CondenseNet-light.


Condensation criterion Condensation criterion measures the importance of each input channel to each convolutional group. It gives the each convolutional group the flexibility to select the most relevant input features and guarantees filters in each convolutional group select the same subset of input channels. To be specific, the importance of jth input channel for the filter group g is evaluated by the averaged absolute value of weights between them across all outputs within the group. This importance score is denoted as \(S_{j}^{g,l}\), namely, \(S_{j}^{g,l} = \sum _{i=1}^{O/G} \left| W_{i,j}^{g,l}\right| .\) The pruning procedure used in CondenseNet can be summarized as follows: For a given group g, calculate importance score \(S_{j}^{g,l}\) for each input channel j; Sort \(S_{j}^{g,l}\) in an ascending order, select the smaller 1/C filters, and denote the corresponding filters as \(\varvec{W}_{j,lower}^{g,l}\); Zero out the filters in \(\varvec{W}_{j,lower}^{g,l}\). This condensation criterion forces each layer to have limited dependency on previous layers and this sparse connectivity pattern reduces computational complexity and model size.

Fig. 3
figure 3

CondenseNet training process illustration with group number G=2 and input channel number R=8. Condensation factor C is set to 4. Group 1 prunes out \({2,3}\rightarrow {1,7}\rightarrow {4,6}\) while Group 2 prunes out \({4,6}\rightarrow {1,8}\rightarrow {5,7}\) in three condensing stages. In the testing stage, \(\{5,8\}\) and \(\{2,3\}\) are kept for Group 1 and Group 2 in the final converted model. Pruned filters are marked by dashed light lines (Best viewed in color)


Training Suppose M denote the total number of training epochs, the first M/2 training epochs called condensing stage is used for pruning while the second M/2 training epochs called optimization stage is used for fine-tuning. Each condensing stage screens out 1/C of filters for each convolutional group, there are \(C-1\) condensing stage in total and therefore only 1/C filters left in the final model. One thing to note is that the pruned weights are not removed during training, the Filter \({\mathcal {F}}\) is masked by a binary tensor \({\mathcal {M}}\) of the same size using the element-wise product. When training is finished, the learned group convolutions are rearranged to standard group convolution through \({\mathcal {M}}\), the final model is called the converted model. The FLOPs and parameters for learned group convolutions reduce to 1/C of standard convolution in the final converted model. To facilitate the understanding of the training process of CondenseNet, we illustrate it in Fig. 3. Suppose that the total training epoch is 300 and the condensation factor C is 4, each learned group convolution is pruned before epoch 50, 100 and 150 with 75%, 50% and 25% filters remained, respectively. Next, the model is fine-tuned for another 150 epochs. A detailed training process is illustrated in Algorithm 1.

4 CondenseNet with exclusive lasso regularizer

At present, various regularization strategies have been developed and they are indeed beneficial to enhance the generalization capacity of DNNs. A simple and effective way is to add a proper regularization term to the loss function. To the best of our knowledge, there is no literature investigating how the exclusive lasso regularizer can be applied in CondenseNet. In order to address this issue, we would like to improve CondenseNet’s performance by introducing an exclusive lasso penalty on learned group convolutions. In later discussions, the proposed model is abbreviated as CondenseNet-elasso for ease of exhibition. First, we introduce some notations for the rest of the paper. Suppose \({\mathcal {D}}={\{\varvec{x}_{i},y_{i}\}}_{i=1}^{N}\) denote the training dataset with N instances. \(\varvec{x}_{i}\) is ith input, \({y}_{i}\) is the class label with k classes. Let \(\varvec{W}^{(l)}\) denote the weights of the lth layer, and L denote the total number of layers. \(\{\varvec{W}^{(l)}\}\) includes the weights across all L layers. \({\mathcal {L}}(\varvec{W})\) is the cross-entropy loss of the classification task parameterized by \(\varvec{W}\). Suppose R and O denote the number of input and output channels, respectively. G indicates the number of group convolutions. \(W_{i,j}^{g,l}\) refers to the jth input for the ith output for group g in layer l.


Regularizer Exclusive lasso was first introduced in a multi-task learning framework to enforce the model parameters in different tasks to compete for features. The training procedure of CondenseNet guarantees that different groups are pruned independently. Based on this observation, we expect different convolutional groups to compete for input channels, therefore achieve diversified feature representations. As a result, we regard filters connected to each input channel as a group. Specifically, we define the regularizer in layer l as:

$$\begin{aligned} \varOmega (\varvec{W}^{(l)})&= \sum _{j=1}^{R} \left( \sum _{g=1}^{G} S_{j}^{g,l}\right) ^{2} \end{aligned}$$
(2)
$$\begin{aligned}&= \sum _{j=1}^{R} \left( \sum _{g=1}^{G} \sum _{i=1}^{O/G} \left| W_{i,j}^{g,l}\right| \right) ^{2} \end{aligned}$$
(3)
$$\begin{aligned}&= \sum _{j=1}^{R} \left( \sum _{i=1}^{O} \left| W_{i,j}^{g,l}\right| \right) ^{2}. \end{aligned}$$
(4)

Experiment results in Sect. 5.4.1 show that our proposed regularization term actually brings in less overlap of incoming channels among different convolutional groups. Moreover, Sect. 5.4.2 shows that the proposed regularizer does help different groups to learn more diversified features. In preliminary experiments, we also tried different grouping strategies, for example, we regard filters from the convolutional group g connected with input channel j as a group, which is \(\varOmega _{1}(\varvec{W}^{(l)}) = \sum _{j=1}^{R} \sum _{g=1}^{G} \left( \sum _{i=1}^{O/G} \left| W_{i,j}^{g,l}\right| \right) ^{2}\). However, this regularizer does not lead to a consistent model performance improvement over baseline models.

One thing to note is that Eq. (2) can be reformulated to Eq. (4) while the format of Eq. (4) is equivalent to the exclusive sparsity regularization in [50]. However, our method is different from [50] in the following two aspects. First, CondenseNets-elasso prune filters through condensation criterion, not from this sparsity inducing regularization term. The exclusive lasso penalty here is used to encourage less redundancy between groups, not strictly forcing each filter being included in exactly one convolutional group. Second, condensation criterion guarantees that filters in the same group take the same subset of incoming channels. In other words, a given input channel can be selected by all filters in a convolutional group or none filter in this convolutional group. This characteristic results in group-level filter pruning while [50] results in kernel-level filter pruning.

Loss We use the following loss function in training:

$$\begin{aligned} {\mathcal {L}}(\{\varvec{W}^{(l)}\},{\mathcal {D}}) + \lambda \sum _{l=1}^{L} \varOmega (\varvec{W}^{l}). \end{aligned}$$
(5)

Here, \(\lambda\) represents the hyperparameter for the regularization term \(\varOmega (\varvec{W}^{(l)})\) and details on the hyperparameter setting are described in Sect. 5.5. One thing to note is that we don’t include \(\varOmega (\varvec{W}^{l})\) at optimization stage since it is only used for selecting promising filters. The detailed training process is shown in Algorithm 1.

figure a

Optimization Campbell et al. give a coordinate descent algorithm for optimization in the context of predictive regression modeling with exclusive lasso regularization [1]. Meanwhile, CGES use a proximal gradient descent method for optimization. Concretely, the learned group convolution kernels can be updated in an iterative manner by first updating the selected variables with the loss-based gradients, then applying the proximal operator for them. Suppose \({\varvec{W}}_{j}^{l} \in {\mathcal {R}}^{O \times 1 \times k \times k}\) represent the weights connected with jth input channel in layer l, the proximal operator for the exclusive lasso regularizer is defined as:

$$\begin{aligned} prox_{EL}({\varvec{W}}_{j}^{l}) = sign(W_{j,i}^{l})(\left| W_{j,i}^{l}\right| -\lambda \Vert {\varvec{W}}_{j}^{l} \Vert _{1})_{+}. \end{aligned}$$
(6)

In our case, we prune filters based on condensation criterion therefore we do not follow Eq. (6) and stick to the conventional stochastic gradient descent algorithm for optimization in our experiments.

5 Experiments

5.1 Datasets

CIFAR Both CIFAR-10(C10) and CIFAR-100(C100) [24] consist of colored images with \(32\times 32\) pixels. C10 and C100 have 10 and 100 classes, respectively. The training and testing sets contain 50,000 and 10,000 images, respectively. Following [16], we adopt a standard data augmentation schedule, including mirroring, shifting and normalizing data using the channel means and standard deviations.


Tiny ImageNet The Tiny ImageNet dataset is a subset of ImageNet [5]. There are 200 categories sampled from 1000 classes of ImageNet, each class consists of 500 training images, 50 validation images and 50 test images. All images are downsampled to a fixed resolution of \(64\times 64\). For preprocessing, 4 pixels are padded on every side and a \(64\times 64\) crop is randomly sampled from the padded image or its horizontal flip. We normalize the data using the channel means and standard deviations. For testing, we only evaluate the original \(64\times 64\) images.

5.2 Training

We use the following default settings in all our experiments unless otherwise specified. The default growth rate is {8,16,32} on CIFAR and is {12,24,48} on Tiny ImageNet. Default condensation factor C and group number G is 4. We call the blocks within the same stage if they have the same feature map size. All our networks have three stages and each stage has the same number of blocks. We choose models with different depths to test the effect of the exclusive lasso regularization on different model scales. Specifically, models with 50, 86, 122 and 182 layers have {8-8-8}, {14-14-14}, {20-20-20} and {30-30-30} blocks in three stages, respectively.

Following the training schedule in [16], all networks are trained using stochastic gradient descent (SGD). Specifically, we adopt Nesterov momentum with a weight decay of 1e-4 and a momentum of 0.9. We train the models using a batch size of 64 for 300 epochs on all datasets by default. We adopt the weight initialization introduced by [11] and batch normalization [19]. For CondenseNet-182 on CIFAR, we train the model for 600 epochs with a dropout [41] rate of 0.1. CondenseNet-182 on Tiny ImageNet is trained for 300 epochs with a dropout rate of 0.1. We use a cosine learning rate [30] starting from 0.1 and gradually reduces to 0. Following the implementation of CondenseNet, we apply a dropout layer after batch normalization layer, which is suggested by [27] to avoid the “variance shift” phenomenon when dropout layers are placed before batch normalization layers. We zero out gradients of the pruned filters during backward propagation. To ensure a fair comparison between our proposed method and the original model, we report the re-implemented results of CondenseNets following https://github.com/ShichenLiu/CondenseNet. We use the same random seed for weight initialization when comparing CondenseNets and CondenseNets-elasso. To save GPU storage and fit large models on one GTX 1080ti, we follow the implementation in memory-efficient DenseNet [36]. To be more specific, we checkpoint the learned group convolution part during training by discarding the intermediate feature maps during the forward pass and recompute them for the backward pass at the expense of additional training time.

5.3 Classification results

Results on CIFAR In Table 1, we perform experiments on CIFAR datasets to validate the effectiveness of our proposed method. Concretely, we compare CondenseNet-elasso with DenseNet, CondenseNet, interleaved group convolution [53] and variants of ResNet and DenseNet [3, 43, 47, 48]. We train CondenseNets and CondenseNets-elasso 3 times and report the mean errors. First, compared with CondenseNets, our proposed method drops classification error rate by 0.03%, 0.08%, 0.18% and 0.12% on CIFAR-10 and 0.52%, 0.38%, 0.25% and 0.34% on CIFAR-100 on models of 50, 86, 122 and 182 layers, respectively. The performance gain becomes larger on CIFAR-10 as the model goes deeper in most cases. While on CIFAR-100, our proposed method achieves a noticeable 0.3725% performance boost on average. Compared with DenseNet-40-60 on CIFAR100, CondenseNet-182-elasso achieve a 1.79% lower error rate with only 0.28x FLOPs and 1.03x parameters.

Moreover, we apply the two recently proposed LAP [34] and Hinge [28] methods on DenseNet for comparison. When comparing our CondenseNet-122-elasso with LAP-DenseNet-122-{8,16,32}, our model achieves 3.04% and 0.57% higher classification accuracy on CIFAR-100 and CIFAR-10, respectively, under the same computation settings. Comparing CondenseNet-122-elasso with Hinge-DenseNet-58-78%pruned, our model achieves 3.86% and 1.67% error rate reduction on CIFAR-100 and CIFAR-10 with 84% and 95% of FLOPs and parameters. One thing to note is that Hinge-DenseNet-58 means un-pruned DenseNet-58 trained by the original implementation and 78% pruned represents 78% of FLOPs being pruned. More detailed training settings on Hinge and LAP are described in Appendix A. Experiment results in Table 1 show that CondenseNets-elasso is more compute efficient.

Next, to validate the regularization effect of this exclusive lasso term, we plot the training and validation loss during the training process in Fig. 4. Our proposed model converges faster and yields lower training accuracy but higher test accuracy. This observation suggests that the performance gain comes from less overfitting induced from exclusive lasso regularization.

Table 1 CIFAR: Model performance comparison between our proposed method and models from ResNet’s and DenseNet’s family
Fig. 4
figure 4

CIFAR100: Training and validation loss comparison of CondenseNet-182 and CondenseNet-182-elasso. Vertical dashed lines show the pruning epochs. The bump at epoch 300 is caused by pruning half of the filters, model performance gets recovered soon

Results on tiny imageNet In Table 2, we conduct experiments on a more complicated Tiny ImageNet dataset to validate the effectiveness of our proposed method. The results in the table show that CondenseNets-elasso drop top-1 error rate by 0.57%, 0.45%, 0.27% and 0.51% while drop top-5 error rate by 0.58%, 0.63%, 0.57% and 0.36% on networks of 50, 86, 122 and 182 layers compared with CondenseNets. Moreover, compared with DenseNet-100, CondenseNet-122-elasso attains 1.62% lower top-1 error with 0.9x FLOPs and 1.16x parameters. As for DenseNet-192, CondenseNet-182-elasso achieves 0.66% lower top-1 error using 0.8x FLOPs and 1.04x parameters. When comparing our CondenseNet-50-elasso with LAP-DenseNet-52-{8,16,32}, our model achieves 3.3% top-1 and 1.1% top-5 higher classification accuracy with only 62% FLOPs and 82% parameters. Comparing CondenseNet-122-elasso with Hinge-DenseNet-58-60%pruned, our model achieves 4.46% top-1 and 2.2% top-5 error rate reduction with 1.1x and 83% of FLOPs and parameters.

Next, a more detailed efficiency comparison on parameters and FLOPs among DenseNet, CondenseNet and CondenseNet-elasso is shown in Fig. 5. From the figure we can see that CondenseNet-elasso is more efficient in FLOPs and parameters compared with DenseNet and CondenseNet. To sum up, our model achieves 0.1%, 0.37% and 0.54% error rate reduction on average on CIFAR10, CIFAR100 and Tiny ImageNet, this observation shows that the exclusive lasso regularization term performs better on complicated datasets, which validates the regularization effect.

Table 2 Tiny ImageNet: Model performance of DenseNets, CondenseNets and CondenseNets-elasso
Fig. 5
figure 5

Tiny ImageNet: Parameters and FLOPs comparison of DenseNets, CondenseNets and CondenseNets-elasso

5.4 Channel reuse

5.4.1 Overlap statistic

In this section, we validate our assumption that exclusive lasso penalty encourages different groups to use different subsets of incoming channels. The results are shown in Fig. 6. Suppose that the total number of incoming channels in lth layer is \(C_{in}^{l}\) and the group number is G. The set of selected channels for gth group in layer l is denoted as \(Set_{in}^{l,g}\). \(OS_{g_{m},g_{k}}^{l}\) denotes the overlap percentage between group \(g_{m}\) and group \(g_{k}\) in lth layer and \(\#\) denotes the number of the set, namely,

$$\begin{aligned} OS_{g_{m},g_{k}}^{l}=\frac{\#(Set_{in}^{l,g_{m}}\bigcap Set_{in}^{l,g_{k}})}{C_{in}^{l}/G} \end{aligned}$$
(7)

As a result, \(OS_{g_{m},g_{k}}^{l}\in [0,1]\). There are \(\left( {\begin{array}{c}G\\ 2\end{array}}\right)\) pair of groups between G groups, therefore the overlap statistic between two groups is defined as:

$$\begin{aligned} OS_{2}^{l} = \frac{2}{G(G-1)}\sum _{m=1}^{G}\sum _{k>m}^{G}OS_{g_{m},g_{k}}^{l} \end{aligned}$$
(8)

Similarly, we can define overlap statistic between 3 groups as \(OS_{3}^{l}\) and 4 groups as \(OS_{4}^{l}\).

Fig. 6
figure 6

CIFAR100: Overlap statistics comparison on different models. “M-50” represents models with 50 layers. “elasso-group-2” denotes average overlap statistics between 2 groups in CondenseNets-elasso

In Fig. 6, three subplots represent overlap statistics in stage1, stage2 and stage3 from left to right. Lines in different colors show average overlap statistics between 2, 3 and 4 groups. In this experiment, hyperparameter \(\lambda\) is set to 1e-7, 5e-7, 1e-6 in three stages, respectively. The plot in the figure first shows that overlap statistics in CondenseNets-elasso are lower than their corresponding CondenseNets’ counterparts in most cases. This result confirms our assumption that the model with exclusive lasso term tends to use more diverse input channels. Second, the gap between the overlap statistic of CondenseNet and CondenseNet-elasso grows larger as the model goes deeper. One possible explanation is that we are using an increasing \(\lambda\) in this experiment, besides, higher-level features are more diverse and may have an intrinsic grouping structure.

5.4.2 Hilbert-Schmidt independence criterion

In this section, we validate our assumption that exclusive lasso penalty encourage different convolutional groups to learn different features. We use Hilbert-Schmidt Independence Criterion (HSIC) [8, 23, 32] as a measurement of similarity. HSIC was originally proposed as a test statistics for determining whether two sets of variables are independent. Suppose X and Y are two random variables, \(HSIC(X,Y)=0\) implies \(p(x,y)=p(x)p(y)\). The empirical estimator of HSIC is defined as

$$\begin{aligned} HSIC(K,L) = \frac{1}{(n-1)^{2}}tr(KHLH) \end{aligned}$$
(9)

and a normalized version of HSIC is defined as

$$\begin{aligned} CKA(K,L) = \frac{HSIC(K,L)}{\sqrt{HSIC(K,K)HSIC(L,L)}} \end{aligned}$$
(10)

where \(H_{n} = I_{n} - \frac{1}{n} 11^{T}\) represents the centering matrix. \(K_{ij} = k(x_{i}, x_{j})\) and \(L_{ij} = l(y_{i}, y_{j})\) represent the kernel matrices. In this experiment, we use the Gaussian kernel \(k({\mathbf {x}}_{i}, {\mathbf {x}}_{j}) = exp(-\Vert {\mathbf {x}}_{i}-{\mathbf {x}}_{j}\Vert )_{2}^{2}/2\sigma ^{2}\) where the bandwidth \(\sigma\) is set to the median distance between examples [23].

Fig. 7
figure 7

HSIC statistics between CondenseNet and CondenseNet-elasso on CIFAR-10, CIFAR-100 and Tiny ImageNet

In this experiment, each output convolutional filter represents each sample, filters in different convolutional groups are in different sets. Therefore there are O samples in total, O/G samples in each set and each feature is of size R. We calculate the average HSIC statistics between different groups on all datasets. Figure 7 shows HSIC statistics between CondenseNets and CondenseNets-elasso. For example, in CIFAR-100, HSIC statistics in CondenseNet are lower than in CondenseNet-elasso, especially in the third stage of the network. This may due to the reason that \(\lambda\) is set in an increasing manner in three stages. Similar results can be found in CIFAR-10 and Tiny ImageNet. This observation validates our assumption that exclusive lasso encourage different groups to learn different features.

5.5 Hyperparameter

In this section, we describe how to choose hyperparameter \(\lambda\). The training dataset can be split into two parts, 80% is used for training while the other 20% is used for validation. We train the model with candidate \(\lambda\) and choose the one with the lowest validation error. The main results (Table 1, Table 2) are trained under the entire training dataset with the chosen \(\lambda\). One thing to note is that all models are trained under the same “training-validation” dataset split for a fair comparison. We choose \(\lambda\) in an increasing manner in three stages since features in lower layers will be quite generic while features in higher layers are more discriminative [49, 50]. This design is in line with the “increasing growth rate” in three stages. Figure 8 shows different settings of \(\lambda\) for CondenseNet-50-elasso on CIFAR-100. CondenseNet-50 error in Fig. 8 shows that our model has lower classification validation error rates even under different \(\lambda\), which validates the effectiveness of our proposed method.

Fig. 8
figure 8

CIFAR100:Choosing hyperparameter \(\lambda\) on CondenseNet-50-elasso. The line chart shows mean error rate of three runs with standard deviations. “CondenseNet-50 error” means CondenseNet-50 without exclusive lasso term where the translucent pink area shows its standard deviation

The chosen \(\lambda\) for all experiments are set as follows. On CIFAR dataset, \(\lambda\) is set as {1e-7,5e-7,1e-6} on all models except CondenseNet-182-elasso. Meanwhile, on Tiny ImageNet dataset, \(\lambda\) is set as {1e-7,5e-7,1e-6} on CondenseNet-50-elasso and is {1e-8,5e-8,1e-7} on CondenseNet-86-elasso and CondenseNet-122-elasso. \(\lambda\) takes value of {1e-8,5e-8,1e-7} on CondenseNet-182-elasso for both datasets. Additionally, since we only have one GPU for training, our experiment results could be improved with further fine-tuning of \(\lambda\).

5.6 Group convolution comparison

In this section, we compare our model with other group convolution variants, including DenseNet with group convolution and shuffle operation, DenseNet-aggregated and Dynamic grouping convolution [55] to measure the effectiveness of our proposed method. These networks are specially designed to match parameters and FLOPs as the baseline models. Results are shown in Table 3. All experiments in this section are conducted on CIFAR100 with growth rate {8-16-32}. Model-52, Model-88 and Model-124 has {8-8-8},{14-14-14} and {20-20-20} blocks in three stages. For a fair comparison, all models are trained under the same training schedule and the cosine learning rate.

Fig. 9
figure 9

The Basic building blocks of different network architectures in Sect. 5.6. a DenseNet-G-shuffle, g means group number. b DenseNet-aggregated, GR means growth rate. c DenseNet-DGConv, b controls model complexity

CondenseNet-light We use CondenseNet-light instead of CondenseNet as the baseline since it has a channel reduction layer in accordance with other listed models. \(\lambda\) is set as {1e-7,5e-7,1e-6} as previously defined. Table 3 shows that CondenseNets-light-elasso drop classification error rate by 0.85%, 0.51% and 0.1%, respectively, compared with CondenseNets-light. We can see model with 122 layers does not drop the top-1 error rate as its shallower counterparts. One possible reason is that the reduction layer, which prunes out half of the filters in transition blocks, can be seen as a regularization method and weaken the effect of our proposed regularization term.


DenseNet-G-shuffle DenseNet-G replaces each 1 × 1 convolutional layer and the following 3 × 3 convolutional layer in DenseNet’s block with group convolution. The group number is set to 4 following CondenseNet. Based on DenseNet-G, we add a channel shuffle operation [54] after bottleneck layers to help information flow across different groups, the resulting model is denoted as DenseNet-G-shuffle. Figure 9a shows its basic building block. DenseNet-G-shuffle has the same architecture with the converted CondenseNet-elasso and therefore it can be seen as CondenseNet-elasso with a pre-defined group structure trained from scratch.

First, we find the effectiveness of shuffle operation gets decreased as model goes deeper. Results in Table 3 shows that DenseNets-G-shuffle achieves 0.92%, 0.21% and 0.04% error rate reductions compared with DenseNets-G in three models. One possible explanation is that deeper DenseNets have more fused features therefore information flow between convolutional groups becomes less effective. Second, our model achieves noticeable 1.15%, 1.84% and 1.51% classification error rate reductions at networks with 52, 88 and 124 layers, which confirms the validity of our proposed method.


DenseNet-aggregated Following ResNeXt’s [47] idea, we design a “ResNeXt-like” DenseNet by setting cardinality as \(2\times {growth rate}\) while the input and output channel number is set to \(8\times {growth rate}\). This design aims to generate models with comparable parameters and flops. The basic building block is shown in Fig. 9b. Results in Table 3 show that our model achieves 0.49%, 1.07% and 1.27% lower error rates compared with DenseNet-aggregated with 0.88x, 0.82x and 0.76x parameters and about 0.77x FLOPS. The error rate gaps between these two sets of models become larger as the model goes deeper.


DenseNet-DGConv Dynamic grouping convolution (DGConv) [55] can learn optimal grouping strategy and group number in each layer automatically through learnable binary relationship matrices. To compare the performance of learned group convolution and dynamic grouping convolution, we replace bottleneck layer in DenseNet with DGConv. The resulting model is DenseNet-DGConv and the basic building block is shown in Fig. 9c.

Original DGConv is inserted into ResNeXt. Here, we make a small modification to apply DGConv to CondenseNet. Suppose the input channel number is \(C^{l}_{in}\) and the output channel number is \(C^{l}_{out}\) in lth layer. Relationship matrix U is a Kronecker product of a 2 × 2 matrices, therefore \(C^{l}_{in}\) and \(C^{l}_{out}\) are the power of 2 by default. In DenseNet, \(C^{l}_{out}\) equals {32, 64, 128} in three stages, which meets this prerequisite. In ResNeXt, \(C^{l}_{in}\) equals \(C^{l}_{out}\) by default, while in DenseNet, \(C^{l}_{in}\ge C^{l}_{out}\) in most blocks. This case is implemented by a variation of GroupDown method in the appendix of the paper. Specifically, suppose \(C^{l}_{in} = r*C^{l}_{out}+m\), where r is an integer, we first construct \(\tilde{{\varvec{U}}}^{l}\in \{0,1\}^{C^{l}_{out}\times {C^{l}_{out}}}\). The relationship matrix \(\varvec{U^{l}}\) can be computed as:

$$\begin{aligned} \varvec{U}^{l}=\tilde{{\varvec{I}}}^{l}_{d}\tilde{{\varvec{U}}}^{l}, \quad \tilde{{\varvec{I}}}^{l}_{d}=[\varvec{I}^{l}_{out},...,\varvec{I}^{l}_{out},\varvec{I}^{l}_{m}] \end{aligned}$$
(11)

where \(\tilde{\varvec{I}^{l}_{d}}\in \{0,1\}^{C^{l}_{in}\times {C^{l}_{out}}}\) is a matrix concatenated by identity matrices \(\varvec{I}^{l}_{out}\in \{0,1\}^{C^{l}_{out}\times {C^{l}_{out}}}\) and the remainder \(\varvec{I}^{l}_{m}=\varvec{I}^{l}_{out}[:m,:,:,:]\), which truncates the first m filters in \(\varvec{I}^{l}_{out}\). When \(C^{l}_{in}<C^{l}_{out}\), we use standard group convolutions with a group number of 4. Parameters and FLOPs of DenseNet-DGConv are calculated through the “gate” parameters in the final model, which is equivalent to a pruning ratio taking values from \(1/2^{n}\) (n is a positive integer). The hyperparameter for measuring model complexity in this paper is denoted as b. We tried different model complexity settings of “b” from 2, 4, 8, 16 and 32, and pick the one with comparable parameters and FLOPs with CondenseNet-elasso. Results in Table 3 shows that CondenseNets-elasso achieves 3.64%, 2.19% and 1.02% smaller error rate compared with its DenseNet-DGConv counterparts with 0.43x, 0.89x and 0.63x parameters and 0.41x, 0.84x and 0.56x FLOPs.

Table 3 CIFAR100: Performance comparison of CondenseNets-light-elasso with other DenseNet group convolution counterparts

5.7 Discussion

In this section, we conduct experiments to validate the effectiveness of our proposed method. First, Sect. 5.3 presents our main results on CIFAR-100, CIFAR-10 and Tiny ImageNet, as long as a Param/FLOPs comparison between different models. Section 5.5 analysis how we choose hyperparameter \(\lambda\). Second, in Sect. 5.4, we validate our assumption that exclusive lasso encourage different convolutional groups to use different subsets of input channels through overlap statistic and learn more diversified features through HSIC statistic. Third, we compare our proposed method with group convolution variants in Sect. 5.6, such as the effect of the shuffle operation, increasing cardinality and dynamic grouping convolutional, which are designed to learn more efficient group convolutions. All the evaluated methods are not as efficient as CondenseNet-elasso under similar computation settings, which validates the effectiveness of our proposed method.

Our method applies to scenarios where the network backbone is DenseNet or stacks of Dense blocks, especially deep convolutional networks. Still, there may be some possible limitations in this study: our proposed approach assumes that diversified features help to boost the performance. There are some other works [4, 58] on decorrelating features in neural networks. Experiment results show that our model outperforms other reported methods with more diversified features in Figs. 7 and 6; however, this assumption needs to be further validated and explored. If this assumption holds, designing new methods to decorrelate features in neural networks can help to build compact models and saves computation.

6 Conclusion

In this paper, we insert the exclusive lasso penalty to CondenseNet to encourage different convolution groups to learn less correlated features. In our experiments, CondenseNets-elasso achieves a noticeable performance boost compared with CondenseNets and other group convolution variants under similar computation budget on three public datasets. Experiment results validate our assumption that the regularizer helps different groups to use different subsets of incoming channels and to learn more diversified features.