Skip to main content
Log in

Deep network compression with teacher latent subspace learning and LASSO

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Deep neural networks have been shown to excel in understanding multimedia by using latent representations to learn complex and useful abstractions. However, they remain unpractical for embedded devices due to memory constraints, high latency, and considerable power consumption at runtime. In this paper, we propose the compression of deep models based on learning lower dimensional subspaces from their latent representations while maintaining a minimal loss of performance. We leverage on the premise that deep convolutional neural networks extract many redundant features to learn new subspaces for feature representation. We construct a compressed model by reconstruction from representations captured by an already trained large model. As compared to state-of-the-art, the proposed approach does not rely on labeled data. Moreover, it allows the use of sparsity inducing LASSO parameter penalty to achieve better compression results than when used to train models from scratch. We perform extensive experiments using VGG-16 and wide ResNet models on CIFAR-10, CIFAR-100, MNIST and SVHN datasets. For instance, VGG-16 with 8.96M parameters trained on CIFAR-10 was pruned by 81.03 % with only 0.26 % generalization performance loss. Correspondingly, the size of the VGG-16 model is reduced from 35MB to 6.72MB to facilitate compact storage. Furthermore, the associated inference time for the same VGG-16 model is reduced from 1.1 secs to 0.6 secs so that inference is accelerated. Particularly, the proposed student models outperform state-of-the-art approaches and the same models trained from scratch.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Mapping from input data space to output (softmax) space

  2. Mapping from a hypothetical hidden layer to the adjoining one

  3. LASSO and L1-norm are used interchangably

  4. Codes will be made publicly available upon paper acceptance

References

  1. Yang J, Nguyen MN, San PP, Li X, Krishnaswamy S (2015) Deep convolutional neural networks on multichannel time series for human activity recognition. In: IJCAI, pp 3995–4001

  2. Oyedotun OK, Khashman A (2017) Deep learning in vision-based static hand gesture recognition. Neural Comput & Applic 28(12):3941–3951

    Article  Google Scholar 

  3. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  4. Kim J, Kwon Lee J, Mu Lee K (2016) Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1646–1654

  5. Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Advances in neural information processing systems, pp 2377–2385

  6. Zagoruyko S, Komodakis N (2016) Wide residual networks. In: BMVC

  7. Mhaskar H, Liao Q, Poggio T (2016) Learning functions: when is deep better than shallow. arXiv:1603.00988

  8. Bianchini M, Scarselli F (2014) On the complexity of neural network classifiers: a comparison between shallow and deep architectures. IEEE Trans Neural Netw Learn Syst 25(8):1553–1565

    Article  Google Scholar 

  9. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  10. Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) Xnor-net: Imagenet classification using binary convolutional neural networks. In: European conference on computer vision. Springer, pp 525–542

  11. Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. In: Advances in neural information processing systems workshop, pp 1–9

  12. Lu L, Guo M, Renals S (2017) Knowledge distillation for small-footprint highway networks. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4820–4824

  13. Chen G, Choi W, Yu X, Han T, Chandraker M (2017) Learning efficient object detection models with knowledge distillation. In: Advances in neural information processing systems, pp 742–751

  14. Zhu X, Gong S, et al. (2018) Knowledge distillation by on-the-fly native ensemble. In: Advances in neural information processing systems, pp 7517–7527

  15. Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y (2015) Fitnets: hints for thin deep nets. In: International conference on learning representations (ICLR), pp 1–13

  16. Han S, Pool J, Tran J, Dally W (2015) Learning both weights and connections for efficient neural network. In: Advances in neural information processing systems, pp 1135–1143

  17. Tibshirani R (1996) Regression shrinkage and Selection via the lasso. J R Stat Soc Ser B Methodol: 267–288

  18. Kim J, Kim Y, Kim Y (2008) A gradient-based optimization algorithm for lasso. J Comput Graph Stat 17(4):994–1009

    Article  MathSciNet  MATH  Google Scholar 

  19. Srinivas S, Babu RV (2015) Data-free parameter pruning for deep neural networks. arXiv:1507.06149

  20. Cheng Y, Yu FX, Feris RS, Kumar S, Choudhary A, Chang S-F (2015) An exploration of parameter redundancy in deep networks with circulant projections. In: Proceedings of the IEEE international conference on computer vision, pp 2857–2865

  21. Arpit D, Jastrzebski S, Ballas N, Krueger D, Bengio E, Kanwal MS, Maharaj T, Fischer A, Courville A, Bengio Y, et al. (2017) A closer look at memorization in deep networks. arXiv:1706.05394

  22. Krizhevsky A, Nair V, Hinton G (2019) Cifar-10, cifar-100 (Canadian Institute for Advanced Research), http://www.cs.toronto.edu/kriz/cifar.html

  23. LeCun Y, Cortes C (2019) Mnist handwritten digit database, http://yann.lecun.com/exdb/mnist/

  24. Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2019) The street view house numbers (svhn) dataset, http://ufldl.stanford.edu/housenumbers/

  25. Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.4400

  26. Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. arXiv:1302.4389

  27. Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2014) Striving for simplicity: the all convolutional net. arXiv:1412.6806

  28. Lee C-Y, Xie S, Gallagher P, Zhang Z, Tu Z (2015) Deeply-supervised nets. In: Artificial intelligence and statistics, pp 562–570

  29. Zhang W, Li Y, Wang S (2019) Learning document representation via topic-enhanced lstm model. Knowl-Based Syst 174:194–204

    Article  Google Scholar 

  30. Zhao L, Zhou Y, Lu H, Fujita H (2019) Parallel computing method of deep belief networks and its application to traffic flow prediction. Knowl-Based Syst 163:972–987

    Article  Google Scholar 

  31. Courbariaux M, Bengio Y, David J-P (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In: Advances in neural information processing systems, pp 3123–3131

  32. Li F, Zhang B, Liu B (2016) Ternary weight networks. arXiv:1605.04711

  33. Denil M, Shakibi B, Dinh L, de Freitas N, et al. (2013) Predicting parameters in deep learning. In: Advances in neural information processing systems, pp 2148–2156

  34. Tai C, Xiao T, Zhang Y, Wang X, et al. (2015) Convolutional neural networks with low-rank regularization. arXiv:1511.06067

  35. Jaderberg M, Vedaldi A, Zisserman A (2014) Speeding up convolutional neural networks with low rank expansions. arXiv:1405.3866

  36. Liu B, Wang M, Foroosh H, Tappen M, Pensky M (2015) Sparse convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 806–814

  37. Buciluǎ C, Caruana R, Niculescu-Mizil A (2006) Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 535–541

  38. Cheng Y, Wang D, Zhou P, Zhang T (2018) Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Proc Mag 35(1):126–136

    Article  Google Scholar 

  39. Wang K, Liu Z, Lin Y, Lin J, Han S (2019) Haq: hardware-aware automated quantization with mixed precision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8612–8620

  40. Zhao R, Hu Y, Dotzel J, De Sa C, Zhang Z (2019) Improving neural network quantization without retraining using outlier channel splitting. In: International conference on machine learning, pp 7543–7552

  41. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis (IJCV): 211–252

  42. Srinivas S, Subramanya A, Venkatesh Babu R (2017) Training sparse neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 138–145

  43. Joly A, Schnitzler F, Geurts P, Wehenkel L (2012) L1-based compression of random forest models. In: 20th European symposium on artificial neural networks

  44. Zhou Y, Jin R, Hoi S C-H (2010) Exclusive lasso for multi-task feature selection. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 988–995

  45. Huang P, Zhang S, Li M, Wang J, Ma C, Wang B, Lv X (2020) Classification of cervical biopsy images based on lasso and el-svm. IEEE Access 8:24219–24228

    Article  Google Scholar 

  46. Simsek S, Kursuncu U, Kibis E, AnisAbdellatif M, Dag A (2020) A hybrid data mining approach for identifying the temporal effects of variables associated with breast cancer survival. Expert Syst Appl 139:112863

    Article  Google Scholar 

  47. Souza PVC, Guimaraes AJ, Araujo VS, Batista LO, Rezende TS (2020) An interpretable machine learning model for human fall detection systems using hybrid intelligent models. In: Challenges and trends in multimodal fall detection for healthcare. Springer, pp 181–205

  48. Niu T, Wang J, Lu H, Yang W, Du P (2020) Developing a deep learning framework with two-stage feature selection for multivariate financial time series forecasting. Expert Syst Appl 148:113237

    Article  Google Scholar 

  49. de Campos Souza PV, Torres LCB, Guimaraes AJ, Araujo VS, Araujo VJS, Rezende TS (2019) Data density-based clustering for regularized fuzzy neural networks based on nullneurons and robust activation function. Soft Comput 23(23):12475–12489

    Article  Google Scholar 

  50. Wang X, Zhang R, Sun Y, Qi J (2018) Kdgan: knowledge distillation with generative adversarial networks. In: Advances in neural information processing systems, pp 775–786

  51. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256

  52. He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp 1026–1034

  53. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456

  54. Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. In: Neural networks: tricks of the trade. Springer, pp 437–478

  55. Belilovsky E, Eickenberg M, Oyallon E (2019) Greedy layerwise learning can scale to imagenet. In: International conference on machine learning, pp 583–593

  56. Jangid M, Srivastava S (2018) Handwritten devanagari character recognition using layer-wise training of deep convolutional neural networks and adaptive gradient methods. J Imaging 4(2):41

    Article  Google Scholar 

  57. Erhan D, Manzagol P-A, Bengio Y, Bengio S, Vincent P (2009) The difficulty of training deep architectures and the effect of unsupervised pre-training. In: Artificial intelligence and statistics, pp 153–160

  58. Erhan D, Bengio Y, Courville A, Manzagol P-A, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11:625–660

    MathSciNet  MATH  Google Scholar 

  59. Ghadiyaram D, Tran D, Mahajan D (2019) Large-scale weakly-supervised pre-training for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12046–12055

  60. Schneider S, Baevski A, Collobert R, Auli M (2019) wav2vec: unsupervised pre-training for speech recognition. In: Proceedings of the interspeech 2019, pp 3465–3469

  61. Lugosch L, Ravanelli M, Ignoto P, Tomar VS, Bengio Y (2019) Speech model pre-training for end-to-end spoken language understanding. In: Proceedings of the interspeech 2019, pp 814–818

  62. Rick Chang J, Li C-L, Poczos B, Vijaya Kumar B, Sankaranarayanan AC (2017) One network to solve them all–solving linear inverse problems using deep projection models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5888–5897

  63. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408

    MathSciNet  MATH  Google Scholar 

  64. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980

  65. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  66. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456

  67. Li H, Kadav A, Durdanovic I, Samet H, Graf HP (2017) Pruning filters for efficient convnets. In: International conference on learning representation

  68. Huang Q, Zhou K, You S, Neumann U (2018) Learning to prune filters in convolutional neural networks. In: IEEE Winter conference on applications of computer vision (WACV), 2018. IEEE, pp 709–718

  69. Zhong J, Ding G, Guo Y, Han J, Wang B (2018) Where to prune: using lstm to guide end-to-end pruning. In: IJCAI, pp 3205–3211

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oyebade K. Oyedotun.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was funded by the National Research Fund (FNR), Luxembourg, under the project reference R-AGR-0424-05-D/Bjö rn Ottersten and CPPP17/IS/11643091/IDform/Aouada

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oyedotun, O.K., Shabayek, A.E.R., Aouada, D. et al. Deep network compression with teacher latent subspace learning and LASSO. Appl Intell 51, 834–853 (2021). https://doi.org/10.1007/s10489-020-01858-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01858-2

Keywords

Navigation