ABSTRACT
With the increasing popularity of live streaming, the interactions from viewers during a live streaming can provide more specific and constructive feedback for both the streamer and platform. In such scenario, the primary and most direct feedback method from the audience is through comments. Thus, mining these live streaming comments to unearth the intentions behind them and, in turn, aiding streamers to enhance their live streaming quality is significant for the well development of live streaming ecosystem. To this end, we introduce the MMLSCU dataset, containing 50,129 intention-annotated comments across multiple modalities (text, images, vi-deos, audio) from eight streaming domains. Using multimodal pretrained large model and drawing inspiration from the Chain of Thoughts (CoT) concept, we implement an end-to-end model to sequentially perform the following tasks: viewer comment intent detection ➛ intent cause mining ➛ viewer comment explanation ➛ streamer policy suggestion. We employ distinct branches for video and audio to process their respective modalities. After obtaining the video and audio representations, we conduct a multimodal fusion with the comment. This integrated data is then fed into the large language model to perform inference across the four tasks following the CoT framework. Experimental results indicate that our model outperforms three multimodal classification baselines on comment intent detection and streamer policy suggestion, and one multimodal generation baselines on intent cause mining and viewer comment explanation. Compared to the models using only text, our multimodal setting yields superior outcomes. Moreover, incorporating CoT allows our model to enhance comment interpretation and more precise suggestions for the streamers. Our proposed dataset and model will bring new research attention on multimodal live streaming comment understanding.
Supplemental Material
- Emna Baccour, Aiman Erbad, Kashif Bilal, Amr Mohamed, Mohsen Guizani, and Mounir Hamdi. 2020. FacebookVideoLive18: A Live Video Streaming Dataset for Streams Metadata and Online Viewers Locations. In 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT). 476--483. https://doi.org/10.1109/ICIoT48696.2020.9089607Google ScholarCross Ref
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.Google Scholar
- Anna Belova, Wen He, and Ziyi Zhong. 2019. E-Sports Talent Scouting Based on Multimodal Twitch Stream Data. CoRR, Vol. abs/1907.01615 (2019). showeprint[arXiv]1907.01615 http://arxiv.org/abs/1907.01615Google Scholar
- Florian Block, Victoria Hodge, Stephen Hobson, Nick Sephton, Sam Devlin, Marian F Ursu, Anders Drachen, and Peter I Cowling. 2018. Narrative bytes: Data-driven content production in esports. In Proceedings of the 2018 ACM international conference on interactive experiences for TV and online video. 29--41.Google ScholarDigital Library
- Michael Bratman. 1987. Intention, plans, and practical reason. (1987).Google Scholar
- Jieting Chen, Junkai Ding, Wenping Chen, and Qin Jin. 2023. Knowledge Enhanced Model for Live Video Comment Generation. arxiv: 2304.14657 [cs.CV]Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2022. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. arxiv: 2211.07636 [cs.CV]Google Scholar
- Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. ImageBind: One Embedding Space To Bind Them All. arxiv: 2305.05665 [cs.CV]Google Scholar
- Thamme Gowda, Weiqiu You, Constantine Lignos, and Jonathan May. 2021. Macro-average: rare types are important too. arXiv preprint arXiv:2104.05700 (2021).Google Scholar
- Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).Google Scholar
- Vikram Gupta, Trisha Mittal, Puneet Mathur, Vaibhav Mishra, Mayank Maheshwari, Aniket Bera, Debdoot Mukherjee, and Dinesh Manocha. 2022. 3MASSIV: multilingual, multimodal and multi-aspect dataset of social media short videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21064--21075.Google ScholarCross Ref
- Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. Proceedings of the 28th ACM International Conference on Multimedia (2020). https://api.semanticscholar.org/CorpusID:218538102Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Ming He, Yong Ge, Enhong Chen, Qi Liu, and Xuesong Wang. 2017. Exploring the emerging type of comment for online videos: Danmu. ACM Transactions on the Web (TWEB), Vol. 12, 1 (2017), 1--33.Google Scholar
- Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).Google Scholar
- Hong Huang, Junjie H Xu, Xiaoling Ling, and Pujana Paliyawan. 2022. Sentence Punctuation for Collaborative Commentary Generation in Esports Live-Streaming. In 2022 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 1--2.Google Scholar
- Tatsuya Ishigaki, Goran Topić, Yumi Hamazono, Hiroshi Noji, Ichiro Kobayashi, Yusuke Miyao, and Hiroya Takamura. 2021. Generating Racing Game Commentary from Vision, Language, and Structured Data. In Proceedings of the 14th International Conference on Natural Language Generation. 103--113.Google ScholarCross Ref
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Athanasios Vasileios Kokkinakis, Simon Demediuk, Isabelle Nölle, Oluseyi Olarewaju, Sagarika Patra, Justus Robertson, Peter York, Alan Pedrassoli Pedrassoli Chitayat, Alistair Coates, Daniel Slawson, et al. 2020. Dax: Data-driven audience experiences in esports. In ACM International Conference on Interactive Media Experiences. 94--105.Google ScholarDigital Library
- Yi-Chieh Lee, Chi-Hsien Yen, Dennis Wang, and Wai-Tat Fu. 2019. Understanding how digital gifting influences social interaction on live streams. In Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. 1--10.Google ScholarDigital Library
- Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arxiv: 2301.12597 [cs.CV]Google Scholar
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.Google Scholar
- Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. 2021. Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387 (2021).Google Scholar
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019).Google Scholar
- Zhicong Lu, Haijun Xia, Seongkook Heo, and Daniel Wigdor. 2018. You watch, you give, and you engage: a study of live streaming practices in China. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1--13.Google ScholarDigital Library
- Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arxiv: 2002.06353 [cs.CV]Google Scholar
- Shuming Ma, Lei Cui, Damai Dai, Furu Wei, and Xu Sun. 2018. LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts. arxiv: 1809.04938 [cs.CL]Google Scholar
- Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, Vol. 35 (2022), 27730--27744.Google Scholar
- Kseniia Palin, Anna Maria Feit, Sunjun Kim, Per Ola Kristensson, and Antti Oulasvirta. 2019. How do people type on mobile devices? Observations from a study with 37,000 volunteers. In Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. 1--12.Google ScholarDigital Library
- Bhargavi Paranjape, Julian Michael, Marjan Ghazvininejad, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2021. Prompting contrastive explanations for commonsense reasoning tasks. arXiv preprint arXiv:2106.06823 (2021).Google Scholar
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763. https://proceedings.mlr.press/v139/radford21a.htmlGoogle Scholar
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, Vol. 21, 1 (2020), 5485--5551.Google ScholarDigital Library
- Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating Multimodal Information in Large Pretrained Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 2359--2369. https://doi.org/10.18653/v1/2020.acl-main.214Google ScholarCross Ref
- Charles Ringer, Mihalis A. Nicolaou, and James Alfred Walker. 2020. TwitchChat: A Dataset for Exploring Livestream Chat. In Proceedings of the Sixteenth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE'20). AAAI Press, Article 37, 7 pages.Google ScholarCross Ref
- Anton Smerdov, Bo Zhou, Paul Lukowicz, and Andrey Somov. 2020. Collection and validation of psychophysiological data from professional and amateur players: A multimodal esports dataset. arXiv preprint arXiv:2011.00958 (2020).Google Scholar
- Thomas Smith, Marianna Obrist, and Peter Wright. 2013. Live-streaming changes the (video) game. In Proceedings of the 11th european conference on Interactive TV and video. 131--138.Google ScholarDigital Library
- Tsunehiko Tanaka and Edgar Simo-Serra. 2021. Lol-v2t: Large-scale esports video description dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4557--4566.Google ScholarCross Ref
- Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).Google Scholar
- Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6558--6569. https://doi.org/10.18653/v1/P19--1656Google ScholarCross Ref
- Weiying Wang, Jieting Chen, and Qin Jin. 2020. VideoIC: A Video Interactive Comments Dataset and Multimodal Multitask Learning for Comments Generation. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM '20). Association for Computing Machinery, New York, NY, USA, 2599--2607. https://doi.org/10.1145/3394171.3413890Google ScholarDigital Library
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arxiv: 2201.11903 [cs.CL]Google Scholar
- Dinghao Xi, Liumin Tang, Runyu Chen, and Wei Xu. 2023. A multimodal time-series method for gifting prediction in live streaming platforms. Information Processing & Management, Vol. 60, 3 (2023), 103254.Google ScholarDigital Library
- Dinghao Xi, Wei Xu, Runyu Chen, Yuhang Zhou, and Zhan Yang. 2021. Sending or not? A multimodal framework for Danmaku comment prediction. Information Processing & Management, Vol. 58, 6 (2021), 102687.Google ScholarDigital Library
- Junjie H. Xu, Yu Nakano, Lingrong Kong, and Kojiro Iizuka. 2023. CS-Lol: A Dataset of Viewer Comment with Scene in E-Sports Live-Streaming. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval (Austin, TX, USA) (CHIIR '23). Association for Computing Machinery, New York, NY, USA, 422--426. https://doi.org/10.1145/3576840.3578334Google ScholarDigital Library
- Hang Zhang, Xin Li, and Lidong Bing. 2023 a. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arxiv: 2306.02858 [cs.CL]Google Scholar
- Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2023 b. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023).Google Scholar
Index Terms
- MMLSCU: A Dataset for Multi-modal Multi-domain Live Streaming Comment Understanding
Recommendations
A Preliminary Study of Emotional Contagion in Live Streaming
CSCW '20 Companion: Companion Publication of the 2020 Conference on Computer Supported Cooperative Work and Social ComputingLive streaming is an increasingly popular communication medium that allows real-time interaction among a broadcaster and an audience of any size. Using archived YouTube live video transcripts and associated live chat messages, we find evidence for ...
Snapstream: Snapshot-based Interaction in Live Streaming for Visual Art
CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing SystemsLive streaming visual art such as drawing or using design software is gaining popularity. An important aspect of live streams is the direct and real-time communication between streamers and viewers. However, currently available text-based interaction ...
Live Streaming as Co-Performance: Dynamics between Center and Periphery in Theatrical Engagement
Live streaming is a highly participatory form of performance, involving various types of audience participation such as liking, commenting, and gifting. But how do streamers and audiences collaborate to deliver live streaming performances? We approach ...
Comments