Personalized time-sync comment generation based on a multimodal transformer

Wang, Hei-Chia; Maslim, Martinus; Hong, Wei-Ting

doi:10.1007/s00530-024-01301-3

Personalized time-sync comment generation based on a multimodal transformer

Special Issue Paper
Published: 30 March 2024

Volume 30, article number 105, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

42 Accesses
Explore all metrics

Abstract

Online video entertainment has attracted large audiences and sustained viewing in various fields. With more than 4.5 billion Internet users worldwide, online video entertainment continues to be the most popular activity for users. Time synchronization comments (TSCs) are a new type of text information in videos. Unlike traditional online video-sharing platforms, where users can only leave comments in the comments section, TSCs can "fly through" the screen at each video playback time. However, the current research on TSC generation does not address the problem of personalization but only focuses on the relationship between images and TSC modalities. Therefore, we propose a multimodal transformer, personalized time-sync comment generation (PTSCG), to generate personalized TSCs. The generated TSCs are more suitable for different users. According to the experimental results, the F − 1 score evaluated for PTSCG after comparing the generated TSC with the original TSCs reached 0.58, which is better than those of other existing models, showing the effectiveness of the method proposed in this study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Crowdsourced Time-Sync Video Recommendation via Semantic-Aware Neural Collaborative Filtering

DCA: Diversified Co-attention Towards Informative Live Video Commenting

Amplifying the music listening experience through song comments on music streaming platforms

Article 10 March 2024

Data availability

Data sharing does not apply to this article, as no datasets were generated or analyzed during the current study.

References

Alam, M.U., Rahmani, R.: FedSepsis: a federated multi-modal deep learning-based Internet of medical things application for early detection of sepsis from electronic health records using raspberry Pi and Jetson nano devices. Sensors 23(2), 970 (2023)
Article Google Scholar
Allam, R., Dinana, H.: The future of TV and online video platforms: a study on predictors of use and interaction with content in the Egyptian evolving telecomm. Media Entertain. Ind. (2021). https://doi.org/10.1177/21582440211040804
Article Google Scholar
Bai, Q., Wu, Y., Zhou, J., He, L.: Aligned variational autoencoder for matching Danmaku and video storylines. Neurocomputing 454, 228–237 (2021). https://doi.org/10.1016/j.neucom.2021.04.118
Article Google Scholar
Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018)
Article Google Scholar
Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets, springer the visual computer June (10), 603–616 (2021)
Chen, J., Wu, W., Hu, W., & He, L. (2020). TSCREC: time-sync comment recommendation in Danmu-enabled videos. In: Paper presented at the 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA
Chen, X., Zhang, Y., Ai, Q., Xu, H., Yan, J., Qin, Z.: Personalized key frame recommendation. In: Paper presented at the 40th International ACM SIGIR Conference on Research and Development in Information (2017)
Chi, X., Fan, Z.-P., Wang, X.: Pricing mode selection for the online short video platform. Soft. Comput. 25(7), 5105–5120 (2021). https://doi.org/10.1007/s00500-020-05513-3
Article Google Scholar
CNNIC.: The 46th China statistical report on the Internet development (In Chinese) (2020). http://www.cnnic.cn/gywm/xwzx/rdxw/202009/W020200929343125745019.pdf. Accessed 4 July 2022
Duan, C., Cui, L., Ma, S., Wei, F., Zhu, C., Zhao, T. (2020). Multimodal matching transformer for live commenting. In: Paper Presented at the European Conference on Artificial Intelligence, Santiago de Compostela, Spain
Han, X., Wang, Y.T., Feng, J.L., Deng, C., Chen, Z.H., Huang, Y.A., Hu, P.W.: A survey of transformer-based multimodal pre-trained modals. Neurocomputing 515, 89–106 (2023)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In: Paper Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA
Hu, R., Singh, A.: UniT: multimodal multitask learning with a unified transformer. Faceb. AI Res. (2021)
Jiang, R., Qu, C., Wang, J., Wang, C., Zheng, Y.: Towards extracting highlights from recorded live videos: an implicit crowdsourcing approach. In: Paper Presented at the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA (2020)
Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: Paper Presented at the The Thirty-Eighth International Conference on Machine Learning, Virtual Conference (2021)
Liao, Z., Xian, Y., Li, J., Zhang, C., Zhao, S.: Time-sync comments denoising via graph convolutional and contextual encoding. Pattern Recogn. Lett. 135, 256–263 (2020). https://doi.org/10.1016/j.patrec.2020.05.004
Article Google Scholar
Liao, Z., Xian, Y., Yang, X., Zhao, Q., Zhang, C., Li, J.: TSCSet: a crowdsourced time-sync comment dataset for exploration of user experience improvement. In: Paper Presented at the 23rd International Conference on Intelligent User Interfaces, Tokyo, Japan (2018)
Ma, S., Cui, L., Dai, D., Wei, F., Sun, X.: Livebot: generating live video comments based on visual and textual contexts. In: Paper Presented at the Thirty-Third AAAI Conference on Artificial Intelligence, Hilton Hawaiian Village, Honolulu, Hawaii, USA (2019)
Manzoor, M.A., Albarri, S., Xian, Z., Meng, Z., Nakov, P., Liang, S.: Multimodality representation learning: a survey on evolution, pretraining and its applications (2023). arXiv:2302.00389
Pan, Z., Li, X., Cui, L., Zhang, Z.: Video clip recommendation model by sentiment analysis of time-sync comments. Multim. Tools Appl. 79(45–46), 33449–33466 (2019). https://doi.org/10.1007/s11042-019-7578-4
Article Google Scholar
Ping, Q.: Video recommendation using crowdsourced time-sync comments. In: Paper Presented at the 12th ACM Conference on Recommender Systems (2018)
Qi, Q., Lin, L., Zhang, R., Xue, C.: MEDT: using multimodal encoding-decoding network as in transformer for multimodal sentiment analysis. IEEE Access 10, 28750–28759 (2022). https://doi.org/10.1109/access.2022.3157712
Article Google Scholar
Research, i.: China short video market research report (in Chinese) (2017). https://www.iimedia.cn/c400/56105.htm. Accessed 4 July 2022
Schneider, F.: China’s viral villages: digital nationalism and the COVID-19 crisis on online video-sharing platform Bilibili. Commun. Public 6(1–4), 48–66 (2021). https://doi.org/10.1177/20570473211048029
Article Google Scholar
Statista.: Online Video & Entertainment. In (2020)
Teng, Y., Song, C., Wu, B.: Learning social relationship from videos via pre-trained multimodal transformer. IEEE Signal Process. Lett. 29, 1377–1381 (2022). https://doi.org/10.1109/lsp.2022.3181849
Article Google Scholar
TwitchTracker. Twitch statistics & charts. In (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Polosukhin, I.: Attention is all you need. In: Paper Presented at the Advances in Neural Information Processing Systems, Long Beach, CA, USA (2017)
Wallach, O.: Which streaming service has the most subscriptions? (2021). https://www.visualcapitalist.com/which-streaming-service-has-the-most-subscriptions/. Accessed 4 July 2022
Wang, M., Tang, X., Chen, F., Lu, Q.: Encrypted live streaming channel identification with time-sync comments. IEEE Access 10, 27630–27642 (2022). https://doi.org/10.1109/access.2022.3157716
Article Google Scholar
Wang, W., Chen, J., Jin, Q.: VideoIC: a video interactive comments dataset and multimodal multitask learning for comments generation. In: Paper Presented at the 28th ACM International Conference on Multimedia, New York, NY, United States (2020)
Wikipedia, T.F.E.: Online video platform (2021a). https://en.wikipedia.org/wiki/Online_video_platform. Accessed 5 July 2022
Wikipedia, T.F.E.: Streamimg media (2021b). https://en.wikipedia.org/wiki/Streaming_media. Accessed 5 July 2022
Wikipedia, T.F.E.: 影片分享網站 (2021c). https://zh.wikipedia.org/wiki/%E5%BD%B1%E7%89%87%E5%88%86%E4%BA%AB%E7%B6%B2%E7%AB%99. Accessed 5 July 2022
Xi, D., Xu, W., Chen, R., Zhou, Y., Yang, Z.: Sending or not? A multimodal framework for Danmaku comment prediction. Inf. Process. Manag. 58(6), 102687 (2021)
Article Google Scholar
Xu, L., Zhang, C.: Bridging video content and comments: synchronized video description with temporal summarization of crowdsourced time-sync comments. In: Paper Presented at the Thirty-First AAAI Conference on Artificial Intelligence (2017)
Yang, W., Wang, K., Ruan, N., Gao, W., Jia, W., Zhao, W., Zhang, Y.: Time-sync video tag extraction using semantic association graph. ACM Trans. Knowl. Discov. Data 13(4), 1–24 (2019). https://doi.org/10.1145/3332932
Article Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with bert (2019). arXiv:1904.09675
Zhang, X., Sun, C., Mei, L.: Agglomerative patterns and cooperative networks of the online video industry in China. Reg. Stud. 55(8), 1429–1441 (2021). https://doi.org/10.1080/00343404.2021.1902493
Article Google Scholar
Zhao, B., Gong, M., Li, X.: Hierarchical multimodal transformer to summarize videos. Neurocomputing 468, 360–369 (2022). https://doi.org/10.1016/j.neucom.2021.10.039
Article Google Scholar

Download references

Funding

This research is based on work supported by the Taiwan Ministry of Science and Technology under Grant Nos. MOST 107-2410-H-006 040-MY3 and MOST 108-2511-H-006-009. The authors would like to acknowledge the partial research grant supported by the "Higher Education SPROUT Project" and the "Center for Innovative FinTech Business Models" of National Cheng Kung University (NCKU), sponsored by the Ministry of Education, Taiwan.

Author information

Authors and Affiliations

Institute of Information Management, College of Management, National Cheng Kung University, Tainan City, Taiwan
Hei-Chia Wang, Martinus Maslim & Wei-Ting Hong
Informatics Department, Faculty of Industrial Technology, Universitas Atma Jaya Yogyakarta, Daerah Istimewa Yogyakarta, Indonesia
Martinus Maslim
Center for Innovative FinTech Business Models, National Cheng Kung University, Tainan City, Taiwan
Hei-Chia Wang

Authors

Hei-Chia Wang
View author publications
You can also search for this author in PubMed Google Scholar
Martinus Maslim
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Ting Hong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Hei-Chia Wang conceived the research, Wei-Ting Hong performed the experiment, and Martinus Maslim and Wei-Ting Hong wrote the manuscript. Hei-Chia Wang reviewed the manuscript.

Corresponding author

Correspondence to Hei-Chia Wang.

Ethics declarations

Conflict of interest

The authors of this study declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, HC., Maslim, M. & Hong, WT. Personalized time-sync comment generation based on a multimodal transformer. Multimedia Systems 30, 105 (2024). https://doi.org/10.1007/s00530-024-01301-3

Download citation

Received: 24 November 2022
Accepted: 21 February 2024
Published: 30 March 2024
DOI: https://doi.org/10.1007/s00530-024-01301-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Personalized time-sync comment generation based on a multimodal transformer

Abstract

Access this article

Similar content being viewed by others

Crowdsourced Time-Sync Video Recommendation via Semantic-Aware Neural Collaborative Filtering

DCA: Diversified Co-attention Towards Informative Live Video Commenting

Amplifying the music listening experience through song comments on music streaming platforms

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Personalized time-sync comment generation based on a multimodal transformer

Abstract

Access this article

Similar content being viewed by others

Crowdsourced Time-Sync Video Recommendation via Semantic-Aware Neural Collaborative Filtering

DCA: Diversified Co-attention Towards Informative Live Video Commenting

Amplifying the music listening experience through song comments on music streaming platforms

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation