research-article

Free Access

MMLSCU: A Dataset for Multi-modal Multi-domain Live Streaming Comment Understanding

Authors:
Zixiang Meng

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

0009-0007-7379-7478
View Profile

,
Qiang Gao

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

0009-0009-0530-9297
View Profile

,
Di Guo

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

0009-0002-3092-6333
View Profile

,
Yunlong Li

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

0009-0003-5068-050X
View Profile

,
Bobo Li

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

0000-0002-0513-5540
View Profile

,
Hao Fei

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore

0000-0003-3026-6347
View Profile

,
Shengqiong Wu

Sea-NExT Joint Lab, National University of Singapore, Singapore, Singapore

Sea-NExT Joint Lab, National University of Singapore, Singapore, Singapore

0000-0001-6192-1194
View Profile

,
Fei Li

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

0000-0003-1816-1761
View Profile

,
Chong Teng

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

0009-0008-6543-2548
View Profile

,
Donghong Ji

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China

0000-0001-9613-5927
View Profile

Authors Info & Claims

WWW '24: Proceedings of the ACM on Web Conference 2024May 2024Pages 4395–4406https://doi.org/10.1145/3589334.3645677

Published:13 May 2024Publication History

WWW '24: Proceedings of the ACM on Web Conference 2024

Pages 4395–4406

ABSTRACT

With the increasing popularity of live streaming, the interactions from viewers during a live streaming can provide more specific and constructive feedback for both the streamer and platform. In such scenario, the primary and most direct feedback method from the audience is through comments. Thus, mining these live streaming comments to unearth the intentions behind them and, in turn, aiding streamers to enhance their live streaming quality is significant for the well development of live streaming ecosystem. To this end, we introduce the MMLSCU dataset, containing 50,129 intention-annotated comments across multiple modalities (text, images, vi-deos, audio) from eight streaming domains. Using multimodal pretrained large model and drawing inspiration from the Chain of Thoughts (CoT) concept, we implement an end-to-end model to sequentially perform the following tasks: viewer comment intent detection ➛ intent cause mining ➛ viewer comment explanation ➛ streamer policy suggestion. We employ distinct branches for video and audio to process their respective modalities. After obtaining the video and audio representations, we conduct a multimodal fusion with the comment. This integrated data is then fed into the large language model to perform inference across the four tasks following the CoT framework. Experimental results indicate that our model outperforms three multimodal classification baselines on comment intent detection and streamer policy suggestion, and one multimodal generation baselines on intent cause mining and viewer comment explanation. Compared to the models using only text, our multimodal setting yields superior outcomes. Moreover, incorporating CoT allows our model to enhance comment interpretation and more precise suggestions for the streamers. Our proposed dataset and model will bring new research attention on multimodal live streaming comment understanding.

Supplemental Material

rfp2166.mp4

Supplemental video

mp4

24.7 MB

Download

References

Emna Baccour, Aiman Erbad, Kashif Bilal, Amr Mohamed, Mohsen Guizani, and Mounir Hamdi. 2020. FacebookVideoLive18: A Live Video Streaming Dataset for Streams Metadata and Online Viewers Locations. In 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT). 476--483. https://doi.org/10.1109/ICIoT48696.2020.9089607Google ScholarCross Ref
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.Google Scholar
Anna Belova, Wen He, and Ziyi Zhong. 2019. E-Sports Talent Scouting Based on Multimodal Twitch Stream Data. CoRR, Vol. abs/1907.01615 (2019). showeprint[arXiv]1907.01615 http://arxiv.org/abs/1907.01615Google Scholar
Florian Block, Victoria Hodge, Stephen Hobson, Nick Sephton, Sam Devlin, Marian F Ursu, Anders Drachen, and Peter I Cowling. 2018. Narrative bytes: Data-driven content production in esports. In Proceedings of the 2018 ACM international conference on interactive experiences for TV and online video. 29--41.Google ScholarDigital Library
Michael Bratman. 1987. Intention, plans, and practical reason. (1987).Google Scholar
Jieting Chen, Junkai Ding, Wenping Chen, and Qin Jin. 2023. Knowledge Enhanced Model for Live Video Comment Generation. arxiv: 2304.14657 [cs.CV]Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2022. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. arxiv: 2211.07636 [cs.CV]Google Scholar
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. ImageBind: One Embedding Space To Bind Them All. arxiv: 2305.05665 [cs.CV]Google Scholar
Thamme Gowda, Weiqiu You, Constantine Lignos, and Jonathan May. 2021. Macro-average: rare types are important too. arXiv preprint arXiv:2104.05700 (2021).Google Scholar
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).Google Scholar
Vikram Gupta, Trisha Mittal, Puneet Mathur, Vaibhav Mishra, Mayank Maheshwari, Aniket Bera, Debdoot Mukherjee, and Dinesh Manocha. 2022. 3MASSIV: multilingual, multimodal and multi-aspect dataset of social media short videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21064--21075.Google ScholarCross Ref
Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. Proceedings of the 28th ACM International Conference on Multimedia (2020). https://api.semanticscholar.org/CorpusID:218538102Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Ming He, Yong Ge, Enhong Chen, Qi Liu, and Xuesong Wang. 2017. Exploring the emerging type of comment for online videos: Danmu. ACM Transactions on the Web (TWEB), Vol. 12, 1 (2017), 1--33.Google Scholar
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).Google Scholar
Hong Huang, Junjie H Xu, Xiaoling Ling, and Pujana Paliyawan. 2022. Sentence Punctuation for Collaborative Commentary Generation in Esports Live-Streaming. In 2022 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 1--2.Google Scholar
Tatsuya Ishigaki, Goran Topić, Yumi Hamazono, Hiroshi Noji, Ichiro Kobayashi, Yusuke Miyao, and Hiroya Takamura. 2021. Generating Racing Game Commentary from Vision, Language, and Structured Data. In Proceedings of the 14th International Conference on Natural Language Generation. 103--113.Google ScholarCross Ref
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Athanasios Vasileios Kokkinakis, Simon Demediuk, Isabelle Nölle, Oluseyi Olarewaju, Sagarika Patra, Justus Robertson, Peter York, Alan Pedrassoli Pedrassoli Chitayat, Alistair Coates, Daniel Slawson, et al. 2020. Dax: Data-driven audience experiences in esports. In ACM International Conference on Interactive Media Experiences. 94--105.Google ScholarDigital Library
Yi-Chieh Lee, Chi-Hsien Yen, Dennis Wang, and Wai-Tat Fu. 2019. Understanding how digital gifting influences social interaction on live streams. In Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. 1--10.Google ScholarDigital Library
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arxiv: 2301.12597 [cs.CV]Google Scholar
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.Google Scholar
Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. 2021. Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387 (2021).Google Scholar
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019).Google Scholar
Zhicong Lu, Haijun Xia, Seongkook Heo, and Daniel Wigdor. 2018. You watch, you give, and you engage: a study of live streaming practices in China. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1--13.Google ScholarDigital Library
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arxiv: 2002.06353 [cs.CV]Google Scholar
Shuming Ma, Lei Cui, Damai Dai, Furu Wei, and Xu Sun. 2018. LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts. arxiv: 1809.04938 [cs.CL]Google Scholar
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, Vol. 35 (2022), 27730--27744.Google Scholar
Kseniia Palin, Anna Maria Feit, Sunjun Kim, Per Ola Kristensson, and Antti Oulasvirta. 2019. How do people type on mobile devices? Observations from a study with 37,000 volunteers. In Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services. 1--12.Google ScholarDigital Library
Bhargavi Paranjape, Julian Michael, Marjan Ghazvininejad, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2021. Prompting contrastive explanations for commonsense reasoning tasks. arXiv preprint arXiv:2106.06823 (2021).Google Scholar
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763. https://proceedings.mlr.press/v139/radford21a.htmlGoogle Scholar
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, Vol. 21, 1 (2020), 5485--5551.Google ScholarDigital Library
Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, AmirAli Bagher Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating Multimodal Information in Large Pretrained Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 2359--2369. https://doi.org/10.18653/v1/2020.acl-main.214Google ScholarCross Ref
Charles Ringer, Mihalis A. Nicolaou, and James Alfred Walker. 2020. TwitchChat: A Dataset for Exploring Livestream Chat. In Proceedings of the Sixteenth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE'20). AAAI Press, Article 37, 7 pages.Google ScholarCross Ref
Anton Smerdov, Bo Zhou, Paul Lukowicz, and Andrey Somov. 2020. Collection and validation of psychophysiological data from professional and amateur players: A multimodal esports dataset. arXiv preprint arXiv:2011.00958 (2020).Google Scholar
Thomas Smith, Marianna Obrist, and Peter Wright. 2013. Live-streaming changes the (video) game. In Proceedings of the 11th european conference on Interactive TV and video. 131--138.Google ScholarDigital Library
Tsunehiko Tanaka and Edgar Simo-Serra. 2021. Lol-v2t: Large-scale esports video description dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4557--4566.Google ScholarCross Ref
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).Google Scholar
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6558--6569. https://doi.org/10.18653/v1/P19--1656Google ScholarCross Ref
Weiying Wang, Jieting Chen, and Qin Jin. 2020. VideoIC: A Video Interactive Comments Dataset and Multimodal Multitask Learning for Comments Generation. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM '20). Association for Computing Machinery, New York, NY, USA, 2599--2607. https://doi.org/10.1145/3394171.3413890Google ScholarDigital Library
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arxiv: 2201.11903 [cs.CL]Google Scholar
Dinghao Xi, Liumin Tang, Runyu Chen, and Wei Xu. 2023. A multimodal time-series method for gifting prediction in live streaming platforms. Information Processing & Management, Vol. 60, 3 (2023), 103254.Google ScholarDigital Library
Dinghao Xi, Wei Xu, Runyu Chen, Yuhang Zhou, and Zhan Yang. 2021. Sending or not? A multimodal framework for Danmaku comment prediction. Information Processing & Management, Vol. 58, 6 (2021), 102687.Google ScholarDigital Library
Junjie H. Xu, Yu Nakano, Lingrong Kong, and Kojiro Iizuka. 2023. CS-Lol: A Dataset of Viewer Comment with Scene in E-Sports Live-Streaming. In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval (Austin, TX, USA) (CHIIR '23). Association for Computing Machinery, New York, NY, USA, 422--426. https://doi.org/10.1145/3576840.3578334Google ScholarDigital Library
Hang Zhang, Xin Li, and Lidong Bing. 2023 a. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arxiv: 2306.02858 [cs.CL]Google Scholar
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2023 b. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023).Google Scholar

Index Terms

MMLSCU: A Dataset for Multi-modal Multi-domain Live Streaming Comment Understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
      2. Natural language generation
2. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

A Preliminary Study of Emotional Contagion in Live Streaming
CSCW '20 Companion: Companion Publication of the 2020 Conference on Computer Supported Cooperative Work and Social Computing

Live streaming is an increasingly popular communication medium that allows real-time interaction among a broadcaster and an audience of any size. Using archived YouTube live video transcripts and associated live chat messages, we find evidence for ...
Read More
Snapstream: Snapshot-based Interaction in Live Streaming for Visual Art
CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems

Live streaming visual art such as drawing or using design software is gaining popularity. An important aspect of live streams is the direct and real-time communication between streamers and viewers. However, currently available text-based interaction ...
Read More
Live Streaming as Co-Performance: Dynamics between Center and Periphery in Theatrical Engagement

Live streaming is a highly participatory form of performance, involving various types of audience participation such as liking, commenting, and gifting. But how do streamers and audiences collaborate to deliver live streaming performances? We approach ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '24: Proceedings of the ACM on Web Conference 2024
May 2024
4826 pages
ISBN:9798400701719
DOI:10.1145/3589334
General Chairs:
Tat-Seng Chua
National University of Singapore
,
Chong-Wah Ngo
Singapore Management University
,
Proceedings Chair:
Roy Ka-Wei Lee
Singapore University of Technology and Design
,
Program Chairs:
Ravi Kumar
Google
,
Hady W. Lauw
Singapore Management University
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
comment understanding
live streaming
multimodal
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 55
  Total Downloads
- Downloads (Last 12 months)55
- Downloads (Last 6 weeks)55
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MMLSCU: A Dataset for Multi-modal Multi-domain Live Streaming Comment Understanding

WWW '24: Proceedings of the ACM on Web Conference 2024

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A Preliminary Study of Emotional Contagion in Live Streaming

Snapstream: Snapshot-based Interaction in Live Streaming for Visual Art

Live Streaming as Co-Performance: Dynamics between Center and Periphery in Theatrical Engagement

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

MMLSCU: A Dataset for Multi-modal Multi-domain Live Streaming Comment Understanding

WWW '24: Proceedings of the ACM on Web Conference 2024

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A Preliminary Study of Emotional Contagion in Live Streaming

Snapstream: Snapshot-based Interaction in Live Streaming for Visual Art

Live Streaming as Co-Performance: Dynamics between Center and Periphery in Theatrical Engagement

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media