Skip to main content
Log in

Personalized time-sync comment generation based on a multimodal transformer

  • Special Issue Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Online video entertainment has attracted large audiences and sustained viewing in various fields. With more than 4.5 billion Internet users worldwide, online video entertainment continues to be the most popular activity for users. Time synchronization comments (TSCs) are a new type of text information in videos. Unlike traditional online video-sharing platforms, where users can only leave comments in the comments section, TSCs can "fly through" the screen at each video playback time. However, the current research on TSC generation does not address the problem of personalization but only focuses on the relationship between images and TSC modalities. Therefore, we propose a multimodal transformer, personalized time-sync comment generation (PTSCG), to generate personalized TSCs. The generated TSCs are more suitable for different users. According to the experimental results, the F − 1 score evaluated for PTSCG after comparing the generated TSC with the original TSCs reached 0.58, which is better than those of other existing models, showing the effectiveness of the method proposed in this study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

Data sharing does not apply to this article, as no datasets were generated or analyzed during the current study.

References

  1. Alam, M.U., Rahmani, R.: FedSepsis: a federated multi-modal deep learning-based Internet of medical things application for early detection of sepsis from electronic health records using raspberry Pi and Jetson nano devices. Sensors 23(2), 970 (2023)

    Article  Google Scholar 

  2. Allam, R., Dinana, H.: The future of TV and online video platforms: a study on predictors of use and interaction with content in the Egyptian evolving telecomm. Media Entertain. Ind. (2021). https://doi.org/10.1177/21582440211040804

    Article  Google Scholar 

  3. Bai, Q., Wu, Y., Zhou, J., He, L.: Aligned variational autoencoder for matching Danmaku and video storylines. Neurocomputing 454, 228–237 (2021). https://doi.org/10.1016/j.neucom.2021.04.118

    Article  Google Scholar 

  4. Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018)

    Article  Google Scholar 

  5. Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets, springer the visual computer June (10), 603–616 (2021)

  6. Chen, J., Wu, W., Hu, W., & He, L. (2020). TSCREC: time-sync comment recommendation in Danmu-enabled videos. In: Paper presented at the 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA

  7. Chen, X., Zhang, Y., Ai, Q., Xu, H., Yan, J., Qin, Z.: Personalized key frame recommendation. In: Paper presented at the 40th International ACM SIGIR Conference on Research and Development in Information (2017)

  8. Chi, X., Fan, Z.-P., Wang, X.: Pricing mode selection for the online short video platform. Soft. Comput. 25(7), 5105–5120 (2021). https://doi.org/10.1007/s00500-020-05513-3

    Article  Google Scholar 

  9. CNNIC.: The 46th China statistical report on the Internet development (In Chinese) (2020). http://www.cnnic.cn/gywm/xwzx/rdxw/202009/W020200929343125745019.pdf. Accessed 4 July 2022

  10. Duan, C., Cui, L., Ma, S., Wei, F., Zhu, C., Zhao, T. (2020). Multimodal matching transformer for live commenting. In: Paper Presented at the European Conference on Artificial Intelligence, Santiago de Compostela, Spain

  11. Han, X., Wang, Y.T., Feng, J.L., Deng, C., Chen, Z.H., Huang, Y.A., Hu, P.W.: A survey of transformer-based multimodal pre-trained modals. Neurocomputing 515, 89–106 (2023)

    Article  Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In: Paper Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA

  13. Hu, R., Singh, A.: UniT: multimodal multitask learning with a unified transformer. Faceb. AI Res. (2021)

  14. Jiang, R., Qu, C., Wang, J., Wang, C., Zheng, Y.: Towards extracting highlights from recorded live videos: an implicit crowdsourcing approach. In: Paper Presented at the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA (2020)

  15. Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: Paper Presented at the The Thirty-Eighth International Conference on Machine Learning, Virtual Conference (2021)

  16. Liao, Z., Xian, Y., Li, J., Zhang, C., Zhao, S.: Time-sync comments denoising via graph convolutional and contextual encoding. Pattern Recogn. Lett. 135, 256–263 (2020). https://doi.org/10.1016/j.patrec.2020.05.004

    Article  Google Scholar 

  17. Liao, Z., Xian, Y., Yang, X., Zhao, Q., Zhang, C., Li, J.: TSCSet: a crowdsourced time-sync comment dataset for exploration of user experience improvement. In: Paper Presented at the 23rd International Conference on Intelligent User Interfaces, Tokyo, Japan (2018)

  18. Ma, S., Cui, L., Dai, D., Wei, F., Sun, X.: Livebot: generating live video comments based on visual and textual contexts. In: Paper Presented at the Thirty-Third AAAI Conference on Artificial Intelligence, Hilton Hawaiian Village, Honolulu, Hawaii, USA (2019)

  19. Manzoor, M.A., Albarri, S., Xian, Z., Meng, Z., Nakov, P., Liang, S.: Multimodality representation learning: a survey on evolution, pretraining and its applications (2023). arXiv:2302.00389

  20. Pan, Z., Li, X., Cui, L., Zhang, Z.: Video clip recommendation model by sentiment analysis of time-sync comments. Multim. Tools Appl. 79(45–46), 33449–33466 (2019). https://doi.org/10.1007/s11042-019-7578-4

    Article  Google Scholar 

  21. Ping, Q.: Video recommendation using crowdsourced time-sync comments. In: Paper Presented at the 12th ACM Conference on Recommender Systems (2018)

  22. Qi, Q., Lin, L., Zhang, R., Xue, C.: MEDT: using multimodal encoding-decoding network as in transformer for multimodal sentiment analysis. IEEE Access 10, 28750–28759 (2022). https://doi.org/10.1109/access.2022.3157712

    Article  Google Scholar 

  23. Research, i.: China short video market research report (in Chinese) (2017). https://www.iimedia.cn/c400/56105.htm. Accessed 4 July 2022

  24. Schneider, F.: China’s viral villages: digital nationalism and the COVID-19 crisis on online video-sharing platform Bilibili. Commun. Public 6(1–4), 48–66 (2021). https://doi.org/10.1177/20570473211048029

    Article  Google Scholar 

  25. Statista.: Online Video & Entertainment. In (2020)

  26. Teng, Y., Song, C., Wu, B.: Learning social relationship from videos via pre-trained multimodal transformer. IEEE Signal Process. Lett. 29, 1377–1381 (2022). https://doi.org/10.1109/lsp.2022.3181849

    Article  Google Scholar 

  27. TwitchTracker. Twitch statistics & charts. In (2018)

  28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Polosukhin, I.: Attention is all you need. In: Paper Presented at the Advances in Neural Information Processing Systems, Long Beach, CA, USA (2017)

  29. Wallach, O.: Which streaming service has the most subscriptions? (2021). https://www.visualcapitalist.com/which-streaming-service-has-the-most-subscriptions/. Accessed 4 July 2022

  30. Wang, M., Tang, X., Chen, F., Lu, Q.: Encrypted live streaming channel identification with time-sync comments. IEEE Access 10, 27630–27642 (2022). https://doi.org/10.1109/access.2022.3157716

    Article  Google Scholar 

  31. Wang, W., Chen, J., Jin, Q.: VideoIC: a video interactive comments dataset and multimodal multitask learning for comments generation. In: Paper Presented at the 28th ACM International Conference on Multimedia, New York, NY, United States (2020)

  32. Wikipedia, T.F.E.: Online video platform (2021a). https://en.wikipedia.org/wiki/Online_video_platform. Accessed 5 July 2022

  33. Wikipedia, T.F.E.: Streamimg media (2021b). https://en.wikipedia.org/wiki/Streaming_media. Accessed 5 July 2022

  34. Wikipedia, T.F.E.: 影片分享網站 (2021c). https://zh.wikipedia.org/wiki/%E5%BD%B1%E7%89%87%E5%88%86%E4%BA%AB%E7%B6%B2%E7%AB%99. Accessed 5 July 2022

  35. Xi, D., Xu, W., Chen, R., Zhou, Y., Yang, Z.: Sending or not? A multimodal framework for Danmaku comment prediction. Inf. Process. Manag. 58(6), 102687 (2021)

    Article  Google Scholar 

  36. Xu, L., Zhang, C.: Bridging video content and comments: synchronized video description with temporal summarization of crowdsourced time-sync comments. In: Paper Presented at the Thirty-First AAAI Conference on Artificial Intelligence (2017)

  37. Yang, W., Wang, K., Ruan, N., Gao, W., Jia, W., Zhao, W., Zhang, Y.: Time-sync video tag extraction using semantic association graph. ACM Trans. Knowl. Discov. Data 13(4), 1–24 (2019). https://doi.org/10.1145/3332932

    Article  Google Scholar 

  38. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with bert (2019). arXiv:1904.09675

  39. Zhang, X., Sun, C., Mei, L.: Agglomerative patterns and cooperative networks of the online video industry in China. Reg. Stud. 55(8), 1429–1441 (2021). https://doi.org/10.1080/00343404.2021.1902493

    Article  Google Scholar 

  40. Zhao, B., Gong, M., Li, X.: Hierarchical multimodal transformer to summarize videos. Neurocomputing 468, 360–369 (2022). https://doi.org/10.1016/j.neucom.2021.10.039

    Article  Google Scholar 

Download references

Funding

This research is based on work supported by the Taiwan Ministry of Science and Technology under Grant Nos. MOST 107-2410-H-006 040-MY3 and MOST 108-2511-H-006-009. The authors would like to acknowledge the partial research grant supported by the "Higher Education SPROUT Project" and the "Center for Innovative FinTech Business Models" of National Cheng Kung University (NCKU), sponsored by the Ministry of Education, Taiwan.

Author information

Authors and Affiliations

Authors

Contributions

Hei-Chia Wang conceived the research, Wei-Ting Hong performed the experiment, and Martinus Maslim and Wei-Ting Hong wrote the manuscript. Hei-Chia Wang reviewed the manuscript.

Corresponding author

Correspondence to Hei-Chia Wang.

Ethics declarations

Conflict of interest

The authors of this study declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, HC., Maslim, M. & Hong, WT. Personalized time-sync comment generation based on a multimodal transformer. Multimedia Systems 30, 105 (2024). https://doi.org/10.1007/s00530-024-01301-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01301-3

Keywords

Navigation