Abstract
Learning to capture both long-range and short-range temporal information is crucial for action recognition task. Previous works utilize 3D ConvNets to capture short-range temporal dynamics in replacement of optical-flow which needs time-consuming extraction. However, dramatically incresed parameters limit the capacity for modeling long-term interactions. In this paper, we propose Segments-based 3D ConvNet (S3D) to integrate both long-term and short-term temporal dynamics. Firstly, we utilize 3D ResNet without temporal downsampling to capture short-range video contents. Secondly, we integrate a sparse sampling strategy to model long-range temporal structure. Finally, experiments on UCF-101 and HMDB-51 datasets show the effectiveness of our S3D compared with corresponding 3D ConvNet.
Export citation and abstract BibTeX RIS
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.