Abstract
Temporal alignment of videos is an important requirement of tasks such as video comparison, analysis and classification. Most of the approaches proposed to date for video alignment leverage dynamic programming algorithms whose parameters are manually tuned. Conversely, this paper proposes a model that can learn its parameters automatically by minimizing a meaningful loss function over a given training set of videos and alignments. For learning, we exploit the effective framework of structural SVM and we extend it with an original scoring function that suitably scores the alignment of two given videos, and a loss function that quantifies the accuracy of a predicted alignment. The experimental results from four video action datasets show that the proposed model has been able to outperform a baseline and a state-of-the-art algorithm by a large margin in terms of alignment accuracy.
Similar content being viewed by others
References
Anderson TW (1984) An introduction to multivariate statistical analysis. Wiley
Bengio Y, Frasconi P (1994) An input output HMM architecture. In: Proceedings of the 7th International Conference on Neural Information Processing Systems (NIPS), pp 427–434
Berndt DJ, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: Proceedings of KDD-94, AAAI-94 Workshop on Knowledge Discovery in Databases, pp 359–370
Caiani E, Porta A, Baselli G, Turiel M, Muzzupappa S, Pieruzzi F, Crema C, Malliani A, Cerutti S (1998) Warped-average template technique to track on a cycle-by-cycle basis the cardiac filling phases on left ventricular volume
Cosine distance. http://reference.wolfram.com/language/ref/CosineDistance.html
Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press
Gong D, Medioni GG (2011) Dynamic manifold warping for view invariant action recognition. In: ICCV, pp 571–578
Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253
Gritai A, Sheikh Y, Shah M (2004) On the use of anthropometry in the invariant analysis of human actions. In: 17th International Conference on Pattern Recognition (ICPR’04), pp 923–926
Hsu E, Pulli K, Popović J (2005) Style translation for human motion. ACM Trans Graph 24(3):1082–1089
Joachims T SVM struct. https://www.cs.cornell.edu/people/tj/svm_light/svm_struct.html
Joachims T, Galor T, Elber R (2005) Learning to align sequences: A maximum-margin approach. In: New Algorithms for Macromolecular Simulation, B. Leimkuhler, LNCS Vol 49, Springer, pp 57–69
Joachims T, Finley T, Yu CJ (2009) Cutting-plane training of structural SVMs. Mach Learn 77(1):27–59
Keogh E, Pazzani M (1998) An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD’98), pp 239–241
Keogh EJ, Pazzani MJ (2001) Derivative dynamic time warping. In: Proceedings of First SIAM International Conference on Data Mining (SDM’2001)
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp 1–8
Maurer CR, Qi R, Raghavan V, Member S (2003) A linear time algorithm for computing exact Euclidean distance transforms of binary images in arbitrary dimensions. IEEE Trans Pattern Anal Mach Intell 25(2):265–270
Myers C, Rabiner L, Rosenberg A (1980) Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Trans Acoust Speech, Signal Process 28(6):623–635
Niebles JC, Chen C-W, Fei-Fei L (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings 11th European Conference in Computer Vision, pp 392–405
Rabiner L, Juang B (1993) Fundamentals of speech recognition prentice-hall signal processing series. Englewood Cliffs, New Jersey
Ryan MS, Nudd GR (1993) The viterbi algorithm. Technical Report, Coventry, UK
Sakoe H, Chiba S (1990) Readings in speech recognition. chapter Dynamic Programming Algorithm Optimization for Spoken Word Recognition. Morgan Kaufmann Publishers Inc, CA, USA, pp 159–165
Skutkova H, Vítek M, Babula P, Kizek R, Provaznik I (2013) Classification of genomic signals using dynamic time warping. BMC Bioinforma 14 (S-10):S1
Soomro K, Zamir AR, Shah M (2012) UCF101 A dataset of 101 human actions classes from videos in the wild CoRR, arXiv:abs/1212.0402
Tsochantaridis I, Joachims T, Hofmann T, Altun Y (2005) Large margin methods for structured and interdependent output variables. JMLR 6:1453–1484
Vedaldi A, Fulkerson B (2010) VLFeat An open and portable library of computer vision algorithms. In: Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, pp 1469–1472
Wang Z, Piccardi M (2016) A pair hidden Markov support vector machine for alignment of human actions. In: Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), pp 800–805
Wu Y (2012) Mining actionlet ensemble for action recognition with depth cameras. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1290–1297
Zhou F Software for canonical time warping. http://www.f-zhou.com/ta_code.html
Zhou F, De la Torre F (2016) Generalized canonical time warping. IEEE Trans Pattern Anal Mach Intell 38(2):279–294
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, Z., Piccardi, M. Minimum-risk temporal alignment of videos. Multimed Tools Appl 77, 14891–14906 (2018). https://doi.org/10.1007/s11042-017-5073-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5073-3