Abstract
Online video entertainment has attracted large audiences and sustained viewing in various fields. With more than 4.5 billion Internet users worldwide, online video entertainment continues to be the most popular activity for users. Time synchronization comments (TSCs) are a new type of text information in videos. Unlike traditional online video-sharing platforms, where users can only leave comments in the comments section, TSCs can "fly through" the screen at each video playback time. However, the current research on TSC generation does not address the problem of personalization but only focuses on the relationship between images and TSC modalities. Therefore, we propose a multimodal transformer, personalized time-sync comment generation (PTSCG), to generate personalized TSCs. The generated TSCs are more suitable for different users. According to the experimental results, the F − 1 score evaluated for PTSCG after comparing the generated TSC with the original TSCs reached 0.58, which is better than those of other existing models, showing the effectiveness of the method proposed in this study.
Similar content being viewed by others
Data availability
Data sharing does not apply to this article, as no datasets were generated or analyzed during the current study.
References
Alam, M.U., Rahmani, R.: FedSepsis: a federated multi-modal deep learning-based Internet of medical things application for early detection of sepsis from electronic health records using raspberry Pi and Jetson nano devices. Sensors 23(2), 970 (2023)
Allam, R., Dinana, H.: The future of TV and online video platforms: a study on predictors of use and interaction with content in the Egyptian evolving telecomm. Media Entertain. Ind. (2021). https://doi.org/10.1177/21582440211040804
Bai, Q., Wu, Y., Zhou, J., He, L.: Aligned variational autoencoder for matching Danmaku and video storylines. Neurocomputing 454, 228–237 (2021). https://doi.org/10.1016/j.neucom.2021.04.118
Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018)
Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets, springer the visual computer June (10), 603–616 (2021)
Chen, J., Wu, W., Hu, W., & He, L. (2020). TSCREC: time-sync comment recommendation in Danmu-enabled videos. In: Paper presented at the 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA
Chen, X., Zhang, Y., Ai, Q., Xu, H., Yan, J., Qin, Z.: Personalized key frame recommendation. In: Paper presented at the 40th International ACM SIGIR Conference on Research and Development in Information (2017)
Chi, X., Fan, Z.-P., Wang, X.: Pricing mode selection for the online short video platform. Soft. Comput. 25(7), 5105–5120 (2021). https://doi.org/10.1007/s00500-020-05513-3
CNNIC.: The 46th China statistical report on the Internet development (In Chinese) (2020). http://www.cnnic.cn/gywm/xwzx/rdxw/202009/W020200929343125745019.pdf. Accessed 4 July 2022
Duan, C., Cui, L., Ma, S., Wei, F., Zhu, C., Zhao, T. (2020). Multimodal matching transformer for live commenting. In: Paper Presented at the European Conference on Artificial Intelligence, Santiago de Compostela, Spain
Han, X., Wang, Y.T., Feng, J.L., Deng, C., Chen, Z.H., Huang, Y.A., Hu, P.W.: A survey of transformer-based multimodal pre-trained modals. Neurocomputing 515, 89–106 (2023)
He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In: Paper Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA
Hu, R., Singh, A.: UniT: multimodal multitask learning with a unified transformer. Faceb. AI Res. (2021)
Jiang, R., Qu, C., Wang, J., Wang, C., Zheng, Y.: Towards extracting highlights from recorded live videos: an implicit crowdsourcing approach. In: Paper Presented at the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA (2020)
Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: Paper Presented at the The Thirty-Eighth International Conference on Machine Learning, Virtual Conference (2021)
Liao, Z., Xian, Y., Li, J., Zhang, C., Zhao, S.: Time-sync comments denoising via graph convolutional and contextual encoding. Pattern Recogn. Lett. 135, 256–263 (2020). https://doi.org/10.1016/j.patrec.2020.05.004
Liao, Z., Xian, Y., Yang, X., Zhao, Q., Zhang, C., Li, J.: TSCSet: a crowdsourced time-sync comment dataset for exploration of user experience improvement. In: Paper Presented at the 23rd International Conference on Intelligent User Interfaces, Tokyo, Japan (2018)
Ma, S., Cui, L., Dai, D., Wei, F., Sun, X.: Livebot: generating live video comments based on visual and textual contexts. In: Paper Presented at the Thirty-Third AAAI Conference on Artificial Intelligence, Hilton Hawaiian Village, Honolulu, Hawaii, USA (2019)
Manzoor, M.A., Albarri, S., Xian, Z., Meng, Z., Nakov, P., Liang, S.: Multimodality representation learning: a survey on evolution, pretraining and its applications (2023). arXiv:2302.00389
Pan, Z., Li, X., Cui, L., Zhang, Z.: Video clip recommendation model by sentiment analysis of time-sync comments. Multim. Tools Appl. 79(45–46), 33449–33466 (2019). https://doi.org/10.1007/s11042-019-7578-4
Ping, Q.: Video recommendation using crowdsourced time-sync comments. In: Paper Presented at the 12th ACM Conference on Recommender Systems (2018)
Qi, Q., Lin, L., Zhang, R., Xue, C.: MEDT: using multimodal encoding-decoding network as in transformer for multimodal sentiment analysis. IEEE Access 10, 28750–28759 (2022). https://doi.org/10.1109/access.2022.3157712
Research, i.: China short video market research report (in Chinese) (2017). https://www.iimedia.cn/c400/56105.htm. Accessed 4 July 2022
Schneider, F.: China’s viral villages: digital nationalism and the COVID-19 crisis on online video-sharing platform Bilibili. Commun. Public 6(1–4), 48–66 (2021). https://doi.org/10.1177/20570473211048029
Statista.: Online Video & Entertainment. In (2020)
Teng, Y., Song, C., Wu, B.: Learning social relationship from videos via pre-trained multimodal transformer. IEEE Signal Process. Lett. 29, 1377–1381 (2022). https://doi.org/10.1109/lsp.2022.3181849
TwitchTracker. Twitch statistics & charts. In (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Polosukhin, I.: Attention is all you need. In: Paper Presented at the Advances in Neural Information Processing Systems, Long Beach, CA, USA (2017)
Wallach, O.: Which streaming service has the most subscriptions? (2021). https://www.visualcapitalist.com/which-streaming-service-has-the-most-subscriptions/. Accessed 4 July 2022
Wang, M., Tang, X., Chen, F., Lu, Q.: Encrypted live streaming channel identification with time-sync comments. IEEE Access 10, 27630–27642 (2022). https://doi.org/10.1109/access.2022.3157716
Wang, W., Chen, J., Jin, Q.: VideoIC: a video interactive comments dataset and multimodal multitask learning for comments generation. In: Paper Presented at the 28th ACM International Conference on Multimedia, New York, NY, United States (2020)
Wikipedia, T.F.E.: Online video platform (2021a). https://en.wikipedia.org/wiki/Online_video_platform. Accessed 5 July 2022
Wikipedia, T.F.E.: Streamimg media (2021b). https://en.wikipedia.org/wiki/Streaming_media. Accessed 5 July 2022
Wikipedia, T.F.E.: 影片分享網站 (2021c). https://zh.wikipedia.org/wiki/%E5%BD%B1%E7%89%87%E5%88%86%E4%BA%AB%E7%B6%B2%E7%AB%99. Accessed 5 July 2022
Xi, D., Xu, W., Chen, R., Zhou, Y., Yang, Z.: Sending or not? A multimodal framework for Danmaku comment prediction. Inf. Process. Manag. 58(6), 102687 (2021)
Xu, L., Zhang, C.: Bridging video content and comments: synchronized video description with temporal summarization of crowdsourced time-sync comments. In: Paper Presented at the Thirty-First AAAI Conference on Artificial Intelligence (2017)
Yang, W., Wang, K., Ruan, N., Gao, W., Jia, W., Zhao, W., Zhang, Y.: Time-sync video tag extraction using semantic association graph. ACM Trans. Knowl. Discov. Data 13(4), 1–24 (2019). https://doi.org/10.1145/3332932
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with bert (2019). arXiv:1904.09675
Zhang, X., Sun, C., Mei, L.: Agglomerative patterns and cooperative networks of the online video industry in China. Reg. Stud. 55(8), 1429–1441 (2021). https://doi.org/10.1080/00343404.2021.1902493
Zhao, B., Gong, M., Li, X.: Hierarchical multimodal transformer to summarize videos. Neurocomputing 468, 360–369 (2022). https://doi.org/10.1016/j.neucom.2021.10.039
Funding
This research is based on work supported by the Taiwan Ministry of Science and Technology under Grant Nos. MOST 107-2410-H-006 040-MY3 and MOST 108-2511-H-006-009. The authors would like to acknowledge the partial research grant supported by the "Higher Education SPROUT Project" and the "Center for Innovative FinTech Business Models" of National Cheng Kung University (NCKU), sponsored by the Ministry of Education, Taiwan.
Author information
Authors and Affiliations
Contributions
Hei-Chia Wang conceived the research, Wei-Ting Hong performed the experiment, and Martinus Maslim and Wei-Ting Hong wrote the manuscript. Hei-Chia Wang reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors of this study declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, HC., Maslim, M. & Hong, WT. Personalized time-sync comment generation based on a multimodal transformer. Multimedia Systems 30, 105 (2024). https://doi.org/10.1007/s00530-024-01301-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01301-3