ABSTRACT
In this study we addressed automatic summarizations generated using modern artificial intelligence techniques. Several mathematical methods for evaluating the performance of automatic summarization exist. Such methods are commonly used as they allow many test cases to be assessed with little human effort as manual assessments are challenging and time consuming. One question is whether the output of such measures matches human perception of summarization quality. In this study we document a study involving the human evaluation of the automatic summarization of 22 academic texts. The unique aspect of this study is that our participants had strong familiarity with the texts as they had studied these texts in depth. The results are quite varied but do not give the impression of unanimous agreement that automatic summarizations are of high quality and are trusted.
- Mohammad Aljanabi, 2023. ChatGpt: Open Possibilities. Iraqi Journal For Computer Science and Mathematics, 2023, 4.1: 62-64.Google Scholar
- Ömer Aydin and Enis Karaarslan. 2022. OpenAI ChatGPT generated literature review: Digital twin in healthcare. Available at SSRN 4308687, 2022.Google Scholar
- Chidansh Bhatt, Andrei Popescu-Belis, and Matthew Cooper. 2016. Audiovisual Summarization of Lectures and Meetings Using a Segment Similarity Graph. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval (ICMR '16). Association for Computing Machinery, New York, NY, USA, 261–264. https://doi.org/10.1145/2911996.2912047Google ScholarDigital Library
- Som Biswas. 2023. ChatGPT and the Future of Medical Writing. Radiology, 2023, 223312.Google ScholarCross Ref
- Josieli Aparecida Marques Boiani, 2019. On the non-disabled perceptions of four common mobility devices in Norway: a comparative study based on semantic differentials. Technology and Disability, 2019, 31.1-2: 15-25.Google Scholar
- Kelly Caine. 2016. Local Standards for Sample Size at CHI. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI '16). Association for Computing Machinery, New York, NY, USA, 981–992. https://doi.org/10.1145/2858036.2858498Google ScholarDigital Library
- Aline Darc Piculo dos Sandos, 2022. Aesthetics and the perceived stigma of assistive technology for visual impairment. Disability and Rehabilitation: Assistive Technology, 2022, 17.2: 152-158.Google Scholar
- Evelyn Eika, and Frode Eika Sandnes, 2022. Starstruck by journal prestige and citation counts? On students’ bias and perceptions of trustworthiness according to clues in publication references. Scientometrics, 2022, 127.11: 6363-6390.Google Scholar
- Thérèse Firmin and Inderjeet Mani. 1998. Automatic text summarization in TIPSTER. In Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998 (TIPSTER '98). Association for Computational Linguistics, USA, 179–180. https://doi.org/10.3115/1119089.1119119Google ScholarDigital Library
- Simon Frieder, 2023. Mathematical Capabilities of ChatGPT. arXiv preprint arXiv:2301.13867, 2023.Google Scholar
- Mahak Gambhir and Vishal Gupta. 2017. Recent automatic text summarization techniques: a survey. Artificial Intelligence Review, 2017, 47: 1-66.Google ScholarDigital Library
- Neslihan Iskender, Tim Polzehl, and Sebastian Moller. 2021. Reliability of human evaluation for text summarization: Lessons learned and challenges ahead. In: Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval). 2021. p. 86-96.Google Scholar
- Wenxiang Jiao, 2023. Is ChatGPT a good translator? A preliminary study. arXiv preprint arXiv:2301.08745, 2023.Google Scholar
- Hitesh Mohan Kaushik, Evelyn Eika, and Frode Eika Sandnes. 2020. Towards universal accessibility on the web: do grammar checking tools improve text readability?. In: Universal Access in Human-Computer Interaction. Design Approaches and Supporting Technologies: 14th International Conference, UAHCI 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings, Part I 22. Springer International Publishing, 2020. p. 272-288.Google ScholarDigital Library
- Farshad Kiyoumarsi. 2015. Evaluation of automatic text summarizations based on human summaries. Procedia-Social and Behavioral Sciences, 2015, 192: 83-91.Google ScholarCross Ref
- Sanghoon Lee, Sunny Shakya, Raj Sunderraman, and Saeid Belkasim. 2013. Real Time Micro-blog Summarization Based on Hadoop/HBase. In Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 03 (WI-IAT '13). IEEE Computer Society, USA, 46–49. https://doi.org/10.1109/WI-IAT.2013.148Google ScholarDigital Library
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. 2004. p. 74-81.Google Scholar
- Peng Li, Yinglin Wang, Wei Gao, and Jing Jiang. 2011. Generating aspect-oriented multi-document summarization with event-aspect model. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, USA, 1137–1146.Google ScholarDigital Library
- Selina Meyer, David Elsweiler, Bernd Ludwig, Marcos Fernandez-Pichel, and David E. Losada. 2022. Do We Still Need Human Assessors? Prompt-Based GPT-3 User Simulation in Conversational AI. In Proceedings of the 4th Conference on Conversational User Interfaces (CUI '22). Association for Computing Machinery, New York, NY, USA, Article 8, 1–6. https://doi.org/10.1145/3543829.3544529Google ScholarDigital Library
- Karolina Owczarzak, 2012. An assessment of the accuracy of automatic evaluation in summarization. In: Proceedings of workshop on evaluation metrics and system comparison for automatic summarization. 2012. p. 1-9.Google ScholarDigital Library
- Frode Eika Sandnes. 2021. HIDE: Short IDs for Robust and Anonymous Linking of Users Across Multiple Sessions in Small HCI Experiments. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (CHI EA '21). Association for Computing Machinery, New York, NY, USA, Article 326, 1–6. https://doi.org/10.1145/3411763.3451794Google ScholarDigital Library
- Teo Susnjak. 2022. ChatGPT: The End of Online Exam Integrity?. arXiv preprint arXiv:2212.09292, 2022.Google Scholar
Index Terms
- Human Experts’ Perceptions of Auto-Generated Summarization Quality
Recommendations
Automatic summarization of MEDLINE citations for evidence-based medical treatment: A topic-oriented evaluation
As the number of electronic biomedical textual resources increases, it becomes harder for physicians to find useful answers at the point of care. Information retrieval applications provide access to databases; however, little research has been done on ...
ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization
AbstractAutomatic text summarization is important in this era due to the exponential growth of documents available on the Internet. In the Vietnamese language, VietnameseMDS is the only publicly available dataset for this task. Although the dataset has ...
An Extractive Automatic Summarization Method for Chinese Long Text
Advanced Data Mining and ApplicationsAbstractThe extractive automatic summarization method is capable of quickly and efficiently generating summaries through the steps of scoring, extracting and eliminating redundant sentences. Currently, most extractive methods utilize deep learning ...
Comments