skip to main content
10.1145/3394486.3403165acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Retrospective Loss: Looking Back to Improve Training of Deep Neural Networks

Published:20 August 2020Publication History

ABSTRACT

Deep neural networks (DNNs) are powerful learning machines that have enabled breakthroughs in several domains. In this work, we introduce a new retrospective loss to improve the training of deep neural network models by utilizing the prior experience available in past model states during training. Minimizing the retrospective loss, along with the task-specific loss, pushes the parameter state at the current training step towards the optimal parameter state while pulling it away from the parameter state at a previous training step. Although a simple idea, we analyze the method as well as to conduct comprehensive sets of experiments across domains - images, speech, text, and graphs - to show that the proposed loss results in improved performance across input domains, tasks, and architectures.

References

  1. [n.d.]. ACGAN-Pytorch. https://github.com/eriklindernoren/PyTorch-GANGoogle ScholarGoogle Scholar
  2. [n.d.]. DCGAN-Pytorch. https://github.com/pytorch/examples/tree/master/dcganGoogle ScholarGoogle Scholar
  3. 2018. Inception Score Code. https://github.com/sbarratt/inception-score-pytorchGoogle ScholarGoogle Scholar
  4. Filippo Maria Bianchi, Daniele Grattarola, Cesare Alippi, and Lorenzo Livi. 2019.Graph Neural Networks with convolutional ARMA filters. arXiv:arXiv:1901.01343Google ScholarGoogle Scholar
  5. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335.Google ScholarGoogle Scholar
  6. Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. 2010. Large scaleonline learning of image similarity through ranking.Journal of Machine Learning Research 11, Mar (2010), 1109--1135.Google ScholarGoogle Scholar
  7. Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-BinHuang. 2019. A Closer Look at Few-shot Classification.CoRRabs/1904.04232(2019). arXiv:1904.04232 http://arxiv.org/abs/1904.04232Google ScholarGoogle Scholar
  8. Wei-Yu Chen. 2019. https://github.com/wyharveychen/CloserLookFewShot.URL(2019).Google ScholarGoogle Scholar
  9. Sentic-Emotion Co. 2019. https://github.com/SenticNet/conv-emotion.URL(2019).Google ScholarGoogle Scholar
  10. Matthias Fey and Jan E. Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.Google ScholarGoogle Scholar
  11. Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, and William J.Dally. 2016. DSD: Dense-Sparse-Dense Training for Deep Neural Networks. In ICLR.Google ScholarGoogle Scholar
  13. Hado V. Hasselt. 2010. Double Q-learning. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel,and A. Culotta (Eds.). Curran Associates, Inc., 2613--2621.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Haowei He, Gao Huang, and Yang Yuan. 2019. Asymmetric Valleys: Beyond Sharpand Flat Local Minima. In Advances in Neural Information Processing Systems 32,H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett(Eds.). Curran Associates, Inc., 2553--2564. http://papers.nips.cc/paper/8524-asymmetric-valleys-beyond-sharp-and-flat-local-minima.pdfGoogle ScholarGoogle Scholar
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.Google ScholarGoogle Scholar
  16. Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition. Springer, 84--92.Google ScholarGoogle ScholarCross RefCross Ref
  17. Yann N. Dauphin David Lopez-Paz Hongyi Zhang, Moustapha Cisse. 2018. mixup:Beyond Empirical Risk Minimization. International Conference on Learning Representations(2018). https://openreview.net/forum?id=r1Ddp1-RbGoogle ScholarGoogle Scholar
  18. Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. 2018. Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent. In Proceedings of the31st Conference On Learning Theory (Proceedings of Machine Learning Research), Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet (Eds.), Vol. 75. PMLR, 1042--1085. http://proceedings.mlr.press/v75/jin18a.htmlGoogle ScholarGoogle Scholar
  19. Rie Johnson and Tong Zhang. 2013. Accelerating Stochastic Gradient Descent Using Predictive Variance Reduction. In Neural Information Processing Systems(Lake Tahoe, Nevada).Google ScholarGoogle Scholar
  20. Hansohl Kim. 2016. Residual Networks for Tiny ImageNet. Stanford CS231Nreports 2016(2016).Google ScholarGoogle Scholar
  21. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv: arXiv:1412.6980Google ScholarGoogle Scholar
  22. Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  23. Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images.Google ScholarGoogle Scholar
  24. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Image Net Classification with Deep Convolutional Neural Networks. In Neural Information Processing Systems(Lake Tahoe, Nevada). 1097--1105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. 2001. Gradient-based learning applied to document recognition. IEEE Press, 306--351.Google ScholarGoogle Scholar
  26. Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea,Alexander Gelbukh, and Erik Cambria. 2019. Dialogue Rnn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6818--6825.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. 2016. Least Squares Generative Adversarial Networks.arXiv:arXiv:1611.04076Google ScholarGoogle Scholar
  28. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015.Human-level control through deep reinforcement learning.Nature518, 7540(Feb. 2015), 529--533. http://dx.doi.org/10.1038/nature14236Google ScholarGoogle Scholar
  29. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. 2011. Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdfGoogle ScholarGoogle Scholar
  30. Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takác. 2017. SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6--11 August 2017. 2613--2621.Google ScholarGoogle Scholar
  31. Hyeonwoo Noh, Tackgeun You, Jonghwan Mun, and Bohyung Han. 2017. Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization. In NIPS'17(Long Beach, California, USA). Curran Associates Inc., USA, 5115--5124. http://dl.acm.org/citation.cfm?id=3295222.3295264Google ScholarGoogle Scholar
  32. Augustus Odena, Christopher Olah, and Jonathon Shlens. 2016. Conditional Image Synthesis With Auxiliary Classifier GANs. arXiv:arXiv:1610.09585Google ScholarGoogle Scholar
  33. Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran-Dinh. 2019. ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization. arXiv: arXiv:1902.05679Google ScholarGoogle Scholar
  34. Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:arXiv:1511.06434Google ScholarGoogle Scholar
  35. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV)115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-yGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved Techniques for Training GANs. arXiv:arXiv:1606.03498Google ScholarGoogle Scholar
  37. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarGoogle ScholarCross RefCross Ref
  38. Björn Schuller, Michel Valster, Florian Eyben, Roddy Cowie, and Maja Pantic. 2012. AVEC 2012: the continuous audio/visual emotion challenge. In Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 449--456.Google ScholarGoogle Scholar
  39. John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. 2015. Trust Region Policy Optimization. In ICML(Lille, France). JMLR.org, 1889--1897. http://dl.acm.org/citation.cfm?id=3045118.3045319Google ScholarGoogle Scholar
  40. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.2017. Proximal Policy Optimization Algorithms. ArXivabs/1707.06347 (2017).Google ScholarGoogle Scholar
  41. Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad. 2008. Collective Classification in Network Data. AI Magazine 29, 3 (2008), 93--106. http://www.cs.iit.edu/~ml/pdfs/sen-aimag08.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  42. Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  43. Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Neural Information Processing Systems. 4077--4087.Google ScholarGoogle Scholar
  44. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the Importance of Initialization and Momentum in Deep Learning. In ICML(Atlanta, GA, USA). JMLR.org, III--1139--III--1147.Google ScholarGoogle Scholar
  45. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,.Google ScholarGoogle Scholar
  46. C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. 2011. The Caltech-UCSD Birds-200--2011 Dataset. Technical Report.Google ScholarGoogle Scholar
  47. Pete Warden. 2017. https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html. Google AI Blog 1 (2017), URL.Google ScholarGoogle Scholar
  48. Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms.arXiv:cs.LG/cs.LG/1708.07747Google ScholarGoogle Scholar
  49. Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random Erasing Data Augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Retrospective Loss: Looking Back to Improve Training of Deep Neural Networks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
        August 2020
        3664 pages
        ISBN:9781450379984
        DOI:10.1145/3394486

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 August 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader