Skip to main content
Log in

Minimum-risk temporal alignment of videos

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Temporal alignment of videos is an important requirement of tasks such as video comparison, analysis and classification. Most of the approaches proposed to date for video alignment leverage dynamic programming algorithms whose parameters are manually tuned. Conversely, this paper proposes a model that can learn its parameters automatically by minimizing a meaningful loss function over a given training set of videos and alignments. For learning, we exploit the effective framework of structural SVM and we extend it with an original scoring function that suitably scores the alignment of two given videos, and a loss function that quantifies the accuracy of a predicted alignment. The experimental results from four video action datasets show that the proposed model has been able to outperform a baseline and a state-of-the-art algorithm by a large margin in terms of alignment accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Anderson TW (1984) An introduction to multivariate statistical analysis. Wiley

  2. Bengio Y, Frasconi P (1994) An input output HMM architecture. In: Proceedings of the 7th International Conference on Neural Information Processing Systems (NIPS), pp 427–434

    Google Scholar 

  3. Berndt DJ, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: Proceedings of KDD-94, AAAI-94 Workshop on Knowledge Discovery in Databases, pp 359–370

    Google Scholar 

  4. Caiani E, Porta A, Baselli G, Turiel M, Muzzupappa S, Pieruzzi F, Crema C, Malliani A, Cerutti S (1998) Warped-average template technique to track on a cycle-by-cycle basis the cardiac filling phases on left ventricular volume

  5. Cosine distance. http://reference.wolfram.com/language/ref/CosineDistance.html

  6. Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press

  7. Gong D, Medioni GG (2011) Dynamic manifold warping for view invariant action recognition. In: ICCV, pp 571–578

    Google Scholar 

  8. Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253

    Article  Google Scholar 

  9. Gritai A, Sheikh Y, Shah M (2004) On the use of anthropometry in the invariant analysis of human actions. In: 17th International Conference on Pattern Recognition (ICPR’04), pp 923–926

    Google Scholar 

  10. Hsu E, Pulli K, Popović J (2005) Style translation for human motion. ACM Trans Graph 24(3):1082–1089

    Article  Google Scholar 

  11. Joachims T SVM struct. https://www.cs.cornell.edu/people/tj/svm_light/svm_struct.html

  12. Joachims T, Galor T, Elber R (2005) Learning to align sequences: A maximum-margin approach. In: New Algorithms for Macromolecular Simulation, B. Leimkuhler, LNCS Vol 49, Springer, pp 57–69

    Google Scholar 

  13. Joachims T, Finley T, Yu CJ (2009) Cutting-plane training of structural SVMs. Mach Learn 77(1):27–59

    Article  MATH  Google Scholar 

  14. Keogh E, Pazzani M (1998) An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD’98), pp 239–241

    Google Scholar 

  15. Keogh EJ, Pazzani MJ (2001) Derivative dynamic time warping. In: Proceedings of First SIAM International Conference on Data Mining (SDM’2001)

    Google Scholar 

  16. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp 1–8

    Google Scholar 

  17. Maurer CR, Qi R, Raghavan V, Member S (2003) A linear time algorithm for computing exact Euclidean distance transforms of binary images in arbitrary dimensions. IEEE Trans Pattern Anal Mach Intell 25(2):265–270

    Article  Google Scholar 

  18. Myers C, Rabiner L, Rosenberg A (1980) Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Trans Acoust Speech, Signal Process 28(6):623–635

    Article  MATH  Google Scholar 

  19. Niebles JC, Chen C-W, Fei-Fei L (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings 11th European Conference in Computer Vision, pp 392–405

    Google Scholar 

  20. Rabiner L, Juang B (1993) Fundamentals of speech recognition prentice-hall signal processing series. Englewood Cliffs, New Jersey

    Google Scholar 

  21. Ryan MS, Nudd GR (1993) The viterbi algorithm. Technical Report, Coventry, UK

    Google Scholar 

  22. Sakoe H, Chiba S (1990) Readings in speech recognition. chapter Dynamic Programming Algorithm Optimization for Spoken Word Recognition. Morgan Kaufmann Publishers Inc, CA, USA, pp 159–165

    Google Scholar 

  23. Skutkova H, Vítek M, Babula P, Kizek R, Provaznik I (2013) Classification of genomic signals using dynamic time warping. BMC Bioinforma 14 (S-10):S1

    Article  Google Scholar 

  24. Soomro K, Zamir AR, Shah M (2012) UCF101 A dataset of 101 human actions classes from videos in the wild CoRR, arXiv:abs/1212.0402

  25. Tsochantaridis I, Joachims T, Hofmann T, Altun Y (2005) Large margin methods for structured and interdependent output variables. JMLR 6:1453–1484

    MathSciNet  MATH  Google Scholar 

  26. Vedaldi A, Fulkerson B (2010) VLFeat An open and portable library of computer vision algorithms. In: Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, pp 1469–1472

    Google Scholar 

  27. Wang Z, Piccardi M (2016) A pair hidden Markov support vector machine for alignment of human actions. In: Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), pp 800–805

    Google Scholar 

  28. Wu Y (2012) Mining actionlet ensemble for action recognition with depth cameras. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1290–1297

    Google Scholar 

  29. Zhou F Software for canonical time warping. http://www.f-zhou.com/ta_code.html

  30. Zhou F, De la Torre F (2016) Generalized canonical time warping. IEEE Trans Pattern Anal Mach Intell 38(2):279–294

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Z., Piccardi, M. Minimum-risk temporal alignment of videos. Multimed Tools Appl 77, 14891–14906 (2018). https://doi.org/10.1007/s11042-017-5073-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5073-3

Keywords

Navigation