skip to main content
10.1145/3589334.3645677acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article
Free Access

MMLSCU: A Dataset for Multi-modal Multi-domain Live Streaming Comment Understanding

Authors Info & Claims
Published:13 May 2024Publication History

ABSTRACT

With the increasing popularity of live streaming, the interactions from viewers during a live streaming can provide more specific and constructive feedback for both the streamer and platform. In such scenario, the primary and most direct feedback method from the audience is through comments. Thus, mining these live streaming comments to unearth the intentions behind them and, in turn, aiding streamers to enhance their live streaming quality is significant for the well development of live streaming ecosystem. To this end, we introduce the MMLSCU dataset, containing 50,129 intention-annotated comments across multiple modalities (text, images, vi-deos, audio) from eight streaming domains. Using multimodal pretrained large model and drawing inspiration from the Chain of Thoughts (CoT) concept, we implement an end-to-end model to sequentially perform the following tasks: viewer comment intent detection ➛ intent cause mining ➛ viewer comment explanation ➛ streamer policy suggestion. We employ distinct branches for video and audio to process their respective modalities. After obtaining the video and audio representations, we conduct a multimodal fusion with the comment. This integrated data is then fed into the large language model to perform inference across the four tasks following the CoT framework. Experimental results indicate that our model outperforms three multimodal classification baselines on comment intent detection and streamer policy suggestion, and one multimodal generation baselines on intent cause mining and viewer comment explanation. Compared to the models using only text, our multimodal setting yields superior outcomes. Moreover, incorporating CoT allows our model to enhance comment interpretation and more precise suggestions for the streamers. Our proposed dataset and model will bring new research attention on multimodal live streaming comment understanding.

Skip Supplemental Material Section

Supplemental Material

rfp2166.mp4

Supplemental video

mp4

24.7 MB

References

  1. Emna Baccour, Aiman Erbad, Kashif Bilal, Amr Mohamed, Mohsen Guizani, and Mounir Hamdi. 2020. FacebookVideoLive18: A Live Video Streaming Dataset for Streams Metadata and Online Viewers Locations. In 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT). 476--483. https://doi.org/10.1109/ICIoT48696.2020.9089607Google ScholarGoogle ScholarCross RefCross Ref
  2. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.Google ScholarGoogle Scholar
  3. Anna Belova, Wen He, and Ziyi Zhong. 2019. E-Sports Talent Scouting Based on Multimodal Twitch Stream Data. CoRR, Vol. abs/1907.01615 (2019). showeprint[arXiv]1907.01615 http://arxiv.org/abs/1907.01615Google ScholarGoogle Scholar
  4. Florian Block, Victoria Hodge, Stephen Hobson, Nick Sephton, Sam Devlin, Marian F Ursu, Anders Drachen, and Peter I Cowling. 2018. Narrative bytes: Data-driven content production in esports. In Proceedings of the 2018 ACM international conference on interactive experiences for TV and online video. 29--41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michael Bratman. 1987. Intention, plans, and practical reason. (1987).Google ScholarGoogle Scholar
  6. Jieting Chen, Junkai Ding, Wenping Chen, and Qin Jin. 2023. Knowledge Enhanced Model for Live Video Comment Generation. arxiv: 2304.14657 [cs.CV]Google ScholarGoogle Scholar
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  8. Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2022. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. arxiv: 2211.07636 [cs.CV]Google ScholarGoogle Scholar
  9. Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. ImageBind: One Embedding Space To Bind Them All. arxiv: 2305.05665 [cs.CV]Google ScholarGoogle Scholar
  10. Thamme Gowda, Weiqiu You, Constantine Lignos, and Jonathan May. 2021. Macro-average: rare types are important too. arXiv preprint arXiv:2104.05700 (2021).Google ScholarGoogle Scholar
  11. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).Google ScholarGoogle Scholar
  12. Vikram Gupta, Trisha Mittal, Puneet Mathur, Vaibhav Mishra, Mayank Maheshwari, Aniket Bera, Debdoot Mukherjee, and Dinesh Manocha. 2022. 3MASSIV: multilingual, multimodal and multi-aspect dataset of social media short videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21064--21075.Google ScholarGoogle ScholarCross RefCross Ref
  13. Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. Proceedings of the 28th ACM International Conference on Multimedia (2020). https://api.semanticscholar.org/CorpusID:218538102Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  15. Ming He, Yong Ge, Enhong Chen, Qi Liu, and Xuesong Wang. 2017. Exploring the emerging type of comment for online videos: Danmu. ACM Transactions on the Web (TWEB), Vol. 12, 1 (2017), 1--33.Google ScholarGoogle Scholar
  16. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).Google ScholarGoogle Scholar
  17. Hong Huang, Junjie H Xu, Xiaoling Ling, and Pujana Paliyawan. 2022. Sentence Punctuation for Collaborative Commentary Generation in Esports Live-Streaming. In 2022 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 1--2.Google ScholarGoogle Scholar
  18. Tatsuya Ishigaki, Goran Topić, Yumi Hamazono, Hiroshi Noji, Ichiro Kobayashi, Yusuke Miyao, and Hiroya Takamura. 2021. Generating Racing Game Commentary from Vision, Language, and Structured Data. In Proceedings of the 14th International Conference on Natural Language Generation. 103--113.Google ScholarGoogle ScholarCross RefCross Ref
  19. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  20. Athanasios Vasileios Kokkinakis, Simon Demediuk, Isabelle Nölle, Oluseyi Olarewaju, Sagarika Patra, Justus Robertson, Peter York, Alan Pedrassoli Pedrassoli Chitayat, Alistair Coates, Daniel Slawson, et al. 2020. Dax: Data-driven audience experiences in esports. In ACM International Conference on Interactive Media Experiences. 94--105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yi-Chieh Lee, Chi-Hsien Yen, Dennis Wang, and Wai-Tat Fu. 2019. Understanding how digital gifting influences social interaction on live streams. In Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. 1--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arxiv: 2301.12597 [cs.CV]Google ScholarGoogle Scholar
  23. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.Google ScholarGoogle Scholar
  24. Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. 2021. Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387 (2021).Google ScholarGoogle Scholar
  25. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019).Google ScholarGoogle Scholar
  26. Zhicong Lu, Haijun Xia, Seongkook Heo, and Daniel Wigdor. 2018. You watch, you give, and you engage: a study of live streaming practices in China. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arxiv: 2002.06353 [cs.CV]Google ScholarGoogle Scholar
  28. Shuming Ma, Lei Cui, Damai Dai, Furu Wei, and Xu Sun. 2018. LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts. arxiv: 1809.04938 [cs.CL]Google ScholarGoogle Scholar
  29. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, Vol. 35 (2022), 27730--27744.Google ScholarGoogle Scholar
  30. Kseniia Palin, Anna Maria Feit, Sunjun Kim, Per Ola Kristensson, and Antti Oulasvirta. 2019. How do people type on mobile devices? Observations from a study with 37,000 volunteers. In Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Bhargavi Paranjape, Julian Michael, Marjan Ghazvininejad, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2021. Prompting contrastive explanations for commonsense reasoning tasks. arXiv preprint arXiv:2106.06823 (2021).Google ScholarGoogle Scholar
  32. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763. https://proceedings.mlr.press/v139/radford21a.htmlGoogle ScholarGoogle Scholar
  33. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, Vol. 21, 1 (2020), 5485--5551.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating Multimodal Information in Large Pretrained Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 2359--2369. https://doi.org/10.18653/v1/2020.acl-main.214Google ScholarGoogle ScholarCross RefCross Ref
  35. Charles Ringer, Mihalis A. Nicolaou, and James Alfred Walker. 2020. TwitchChat: A Dataset for Exploring Livestream Chat. In Proceedings of the Sixteenth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE'20). AAAI Press, Article 37, 7 pages.Google ScholarGoogle ScholarCross RefCross Ref
  36. Anton Smerdov, Bo Zhou, Paul Lukowicz, and Andrey Somov. 2020. Collection and validation of psychophysiological data from professional and amateur players: A multimodal esports dataset. arXiv preprint arXiv:2011.00958 (2020).Google ScholarGoogle Scholar
  37. Thomas Smith, Marianna Obrist, and Peter Wright. 2013. Live-streaming changes the (video) game. In Proceedings of the 11th european conference on Interactive TV and video. 131--138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Tsunehiko Tanaka and Edgar Simo-Serra. 2021. Lol-v2t: Large-scale esports video description dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4557--4566.Google ScholarGoogle ScholarCross RefCross Ref
  39. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).Google ScholarGoogle Scholar
  40. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6558--6569. https://doi.org/10.18653/v1/P19--1656Google ScholarGoogle ScholarCross RefCross Ref
  41. Weiying Wang, Jieting Chen, and Qin Jin. 2020. VideoIC: A Video Interactive Comments Dataset and Multimodal Multitask Learning for Comments Generation. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM '20). Association for Computing Machinery, New York, NY, USA, 2599--2607. https://doi.org/10.1145/3394171.3413890Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arxiv: 2201.11903 [cs.CL]Google ScholarGoogle Scholar
  43. Dinghao Xi, Liumin Tang, Runyu Chen, and Wei Xu. 2023. A multimodal time-series method for gifting prediction in live streaming platforms. Information Processing & Management, Vol. 60, 3 (2023), 103254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Dinghao Xi, Wei Xu, Runyu Chen, Yuhang Zhou, and Zhan Yang. 2021. Sending or not? A multimodal framework for Danmaku comment prediction. Information Processing & Management, Vol. 58, 6 (2021), 102687.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Junjie H. Xu, Yu Nakano, Lingrong Kong, and Kojiro Iizuka. 2023. CS-Lol: A Dataset of Viewer Comment with Scene in E-Sports Live-Streaming. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval (Austin, TX, USA) (CHIIR '23). Association for Computing Machinery, New York, NY, USA, 422--426. https://doi.org/10.1145/3576840.3578334Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Hang Zhang, Xin Li, and Lidong Bing. 2023 a. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arxiv: 2306.02858 [cs.CL]Google ScholarGoogle Scholar
  47. Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2023 b. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023).Google ScholarGoogle Scholar

Index Terms

  1. MMLSCU: A Dataset for Multi-modal Multi-domain Live Streaming Comment Understanding

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WWW '24: Proceedings of the ACM on Web Conference 2024
          May 2024
          4826 pages
          ISBN:9798400701719
          DOI:10.1145/3589334

          Copyright © 2024 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 May 2024

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%
        • Article Metrics

          • Downloads (Last 12 months)55
          • Downloads (Last 6 weeks)55

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader