research-article

ASTRA: An Action Spotting TRAnsformer for Soccer Videos

Authors:
Artur Xarles

Universitat de Barcelona & Computer Vision Center, Barcelona, Spain

Universitat de Barcelona & Computer Vision Center, Barcelona, Spain

0000-0002-0800-0270
View Profile

,
Sergio Escalera

Universitat de Barcelona & Computer Vision Center & Aalborg University, Barcelona, Spain

Universitat de Barcelona & Computer Vision Center & Aalborg University, Barcelona, Spain

0000-0003-0617-8873
View Profile

,
Thomas B. Moeslund

Aalborg University, Aalborg, Denmark

Aalborg University, Aalborg, Denmark

0000-0001-7584-5209
View Profile

,
Albert Clapés

Universitat de Barcelona & Computer Vision Center, Barcelona, Spain

Universitat de Barcelona & Computer Vision Center, Barcelona, Spain

0000-0002-4089-9060
View Profile

MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in SportsOctober 2023Pages 93–102https://doi.org/10.1145/3606038.3616153

Published:29 October 2023Publication History

MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports

Pages 93–102

ABSTRACT

In this paper, we introduce ASTRA, a Transformer-based model designed for the task of Action Spotting in soccer matches. ASTRA addresses several challenges inherent in the task and dataset, including the requirement for precise action localization, the presence of a long-tail data distribution, non-visibility in certain actions, and inherent label noise. To do so, ASTRA incorporates (a) a Transformer encoder-decoder architecture to achieve the desired output temporal resolution and to produce precise predictions, (b) a balanced mixup strategy to handle the long-tail distribution of the data, (c) an uncertainty-aware displacement head to capture the label variability, and (d) input audio signal to enhance detection of non-visible actions. Results demonstrate the effectiveness of ASTRA, achieving a tight Average-mAP of 66.82 on the test set. Moreover, in the SoccerNet 2023 Action Spotting challenge, we secure the 3rd position with an Average-mAP of 70.21 on the challenge set.

References

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).Google Scholar
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. 2017. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision. 5561--5569.Google ScholarCross Ref
Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2019. End-to-end, single-stream temporal action detection in untrimmed videos. (2019).Google Scholar
Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2911--2920.Google ScholarCross Ref
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part I 16. Springer, 213--229.Google Scholar
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarCross Ref
Yunze Chen, Mengjuan Chen, Rui Wu, Jiagang Zhu, Zheng Zhu, Qingyi Gu, and Horizon Robotics. 2020. Refinement of Boundary Regression Using Uncertainty in Temporal Action Localization.. In BMVC.Google Scholar
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google Scholar
Anthony Cioppa, Adrien Deliege, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck, Rikke Gade, and Thomas B Moeslund. 2020. A context-aware loss function for action spotting in soccer videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13126--13136.Google ScholarCross Ref
Adrien Deliege, Anthony Cioppa, Silvio Giancola, Meisam J Seikavandi, Jacob V Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B Moeslund, and Marc Van Droogenbroeck. 2021. Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4508--4519.Google ScholarCross Ref
Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep action proposals for action understanding. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part III 14. Springer, 768--784.Google Scholar
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 776--780.Google ScholarDigital Library
Silvio Giancola, Anthony Cioppa, Adrien Deliège, Floriane Magera, Vladimir Somers, Le Kang, Xin Zhou, Olivier Barnich, Christophe De Vleeschouwer, Alexandre Alahi, et al. 2022. SoccerNet 2022 challenges results. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports. 75--86.Google ScholarDigital Library
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision. 5842--5850.Google ScholarCross Ref
Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1914--1923.Google ScholarCross Ref
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131--135.Google Scholar
James Hong, Matthew Fisher, Michaël Gharbi, and Kayvon Fatahalian. 2021. Video pose distillation for few-shot, fine-grained sports action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9254--9263.Google ScholarCross Ref
James Hong, Haotian Zhang, Michaël Gharbi, Matthew Fisher, and Kayvon Fatahalian. 2022. Spotting Temporally Precise, Fine-Grained Events in Video. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXV. Springer, 33--51.Google Scholar
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492--5501.Google ScholarCross Ref
Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. 2021. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3320--3329.Google ScholarCross Ref
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 7083--7093.Google ScholarCross Ref
Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia. 988--996.Google ScholarDigital Library
Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. 2022. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing , Vol. 31 (2022), 5427--5441.Google ScholarDigital Library
Banoth Thulasya Naik, Mohammad Farukh Hashmi, and Neeraj Dhanraj Bokde. 2022. A comprehensive review of computer vision in sports: Open issues, future trends and research directions. Applied Sciences, Vol. 12, 9 (2022), 4429.Google ScholarCross Ref
Alessandro Pieropan, Giampiero Salvi, Karl Pauwels, and Hedvig Kjellström. 2014. Audio-visual classification and detection of human manipulation actions. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 3045--3052.Google ScholarCross Ref
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 485--494.Google ScholarCross Ref
Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam, and Naveed Akhtar. 2022. MAiVAR: Multimodal Audio-Image and Video Action Recognizer. In 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE, 1--5.Google Scholar
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2616--2625.Google ScholarCross Ref
Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. 2023. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18857--18866.Google ScholarCross Ref
Joao VB Soares and Avijit Shah. 2022. Action spotting using dense detection anchors revisited: Submission to the SoccerNet Challenge 2022. arXiv preprint arXiv:2206.07846 (2022).Google Scholar
Jo ao VB Soares, Avijit Shah, and Topojoy Biswas. 2022. Temporally Precise Action Spotting in Soccer Videos Using Dense Detection Anchors. In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2796--2800.Google Scholar
Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2020. Gate-shift networks for video action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1102--1111.Google ScholarCross Ref
Graham Thomas, Rikke Gade, Thomas B Moeslund, Peter Carr, and Adrian Hilton. 2017. Computer vision for sports: Current applications and research topics. Computer Vision and Image Understanding , Vol. 159 (2017), 3--18.Google ScholarCross Ref
Bastien Vanderplaetse and Stephane Dupont. 2020. Improved soccer action spotting using both audio and video streams. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 896--897.Google ScholarCross Ref
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20--36.Google ScholarCross Ref
Ting-Ting Xie, Christos Tzelepis, and Ioannis Patras. 2020. Boundary uncertainty in a single-stage temporal action localization network. arXiv preprint arXiv:2008.11170 (2020).Google Scholar
Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu. 2022. Finediving: A fine-grained dataset for procedure-aware action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2949--2958.Google ScholarCross Ref
Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10156--10165.Google ScholarCross Ref
Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, and Junwei Han. 2020. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing , Vol. 29 (2020), 8535--8548.Google ScholarCross Ref
Boyu Zhang, Jiayuan Chen, Yinfei Xu, Hui Zhang, Xu Yang, and Xin Geng. 2021a. Auto-Encoding Score Distribution Regression for Action Quality Assessment. arXiv preprint arXiv:2111.11029 (2021).Google Scholar
Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part IV. Springer, 492--510.Google Scholar
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).Google Scholar
Haotian Zhang, Cristobal Sciutto, Maneesh Agrawala, and Kayvon Fatahalian. 2021b. Vid2player: Controllable video sprites that behave and appear like professional tennis players. ACM Transactions on Graphics (TOG) , Vol. 40, 3 (2021), 1--16.Google ScholarDigital Library
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (ECCV). 803--818.Google ScholarDigital Library
Xin Zhou, Le Kang, Zhiyu Cheng, Bo He, and Jingyu Xin. 2021. Feature combination meets attention: Baidu soccer embeddings and transformer based temporal detection. arXiv preprint arXiv:2106.14447 (2021). ioGoogle Scholar

Index Terms

ASTRA: An Action Spotting TRAnsformer for Soccer Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Video segmentation
      2. Computer vision tasks
        Activity recognition and understanding

Recommendations

A Transformer-based System for Action Spotting in Soccer Videos
MMSports '22: Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports

Action Spotting in the broadcast soccer game is important to understand salient actions and video summary applications. In this paper, we propose an efficient transformer-based system for action spotting in soccer videos. We first use the multi-scale ...
Read More
Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization
Computer Vision – ECCV 2018
Abstract
State-of-the-art temporal action detectors inefficiently search the entire video for specific actions. Despite the encouraging progress these methods achieve, it is crucial to design automated approaches that only explore parts of the video which ...
Read More
A Graph-Based Method for Soccer Action Spotting Using Unsupervised Player Classification
MMSports '22: Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports

Action spotting in soccer videos is the task of identifying the specific time when a certain key action of the game occurs. Lately, it has received a large amount of attention and powerful methods have been introduced. Action spotting involves ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports
October 2023
174 pages
ISBN:9798400702693
DOI:10.1145/3606038
Program Chairs:
Rainer Lienhart
University of Augsburg
,
Thomas B. Moeslund
Aalborg University
,
Hideo Saito
Keio University
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
action spotting
balanced mixup
computer vision
transformer encoder-decoder
uncertainty estimation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate29of49submissions,59%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 121
  Total Downloads
- Downloads (Last 12 months)121
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ASTRA: An Action Spotting TRAnsformer for Soccer Videos

MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Transformer-based System for Action Spotting in Soccer Videos

Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization

A Graph-Based Method for Soccer Action Spotting Using Unsupervised Player Classification