ABSTRACT
Conceptual complexity of a written text plays an important role in maintaining reader's interest in reading it. Therefore, automatic text simplification systems should, apart from considering lexical and syntactic complexity of a text, also consider the conceptual complexity. In this study, we analyze and compare two widely used English text simplification corpora, one professionally produced (Newsela) and the other collaboratively made by amateurs and enthusiasts (English Wikipedia–Simple English Wikipedia), focusing on 19 conceptual complexity features. The results indicated that simplification operations made during the production of Simple English Wikipedia in many cases do not follow the patterns of the professionally simplified corpora, thus casting doubts on adequacy of using Simple English Wikipedia as training material for automatic text simplification systems.
- Marcelo Amancio and Lucia Specia. 2014. An Analysis of Crowdsourced Text Simplifications. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR). Association for Computational Linguistics, Gothenburg, Sweden, 123–130. https://doi.org/10.3115/v1/W14-1214.Google ScholarCross Ref
- Barbara Arfé, Lucia Mason, and Inmaculada Fajardo. 2017. Simplifying informational text structure for struggling readers. Reading and Writing (24 Oct 2017).Google Scholar
- William Coster and David Kauchak. 2011. Learning to Simplify Sentences Using Wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). 1–9.Google Scholar
- William Coster and David Kauchak. 2011. Simple English Wikipedia: a new text simplification task. In Proceedings of ACL&HLT. 665–669.Google Scholar
- Dan Feblowitz and David Kauchak. 2013. Sentence Simplification as Tree Transduction. In Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations. 1–10. http://www.aclweb.org/anthology/W13-2901.Google Scholar
- Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. (2017).Google Scholar
- Colby Horn, Cathryn Manduca, and David Kauchak. 2014. Learning a Lexical Simplifier Using Wikipedia. In Proceedings of ACL 2014 (Short Papers). 458–463.Google ScholarCross Ref
- Ioana Hulpuş, Narumol Prangnawarat, and Conor Hayes. 2015. Path-Based Semantic Relatedness on Linked Data and Its Use to Word and Entity Disambiguation. In the Semantic Web - ISWC 2015. Springer International Publishing, Cham, 442–457.Google Scholar
- Ioana Hulpus, Sanja Štajner, and Heiner Stuckenschmidt. 2019. A Spreading Activation Framework for Tracking Conceptual Complexity of Texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 3878–3887. https://doi.org/10.18653/v1/P19-1377.Google ScholarCross Ref
- William Hwang, Hannaneh Hajishirzi, Mari Ostendorf, andWeiWu. 2015. Aligning Sentences from Standard Wikipedia to Simple Wikipedia. In Proceedings of NAACL&HLT, pp. 211–217.Google ScholarCross Ref
- David Kauchak. 2013. Improving Text Simplification Language Modeling Using Unsimplified Text Data. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ACL, 1537–1546.Google Scholar
- W. Kintsch and T. A. van Dijk. 1978. Towards a model of text comprehension and production. Psychological Review 85 (1978), pp. 363–394.Google ScholarCross Ref
- D. S. McNamara, A. Graesser, and M. Louwerse. 2012. Sources of text difficulty: Across the ages and genres. Lanham, MD: Rowman & Littlefield Education.Google Scholar
- Newsela. 2016. Newsela Article Corpus. https://newsela.com/data. Version: 2016-01-29.Google Scholar
- Sergiu Nisioi, Sanja Štajner, Simone Paolo Ponzetto, and Liviu P. Dinu. 2017. Exploring Neural Text Simplification Models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). 85–91.Google Scholar
- Sanja Štajner, Hannah Béchara, and Horacio Saggion. 2015. A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Beijing, China, 823–828. https://doi.org/10.3115/v1/P15-2135.Google Scholar
- Sanja Štajner and Ioana Hulpuş. 2018. Automatic Assessment of Conceptual Text Complexity Using Knowledge Graphs. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 318–330. https://www.aclweb.org/anthology/C18-1027.Google Scholar
- Sanja Stajner, Sergiu Nisioi, and Ioana Hulpus. 2020. CoCo: A Tool for Automatically Assessing Conceptual Complexity of Texts. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 7179–7186. https://www.aclweb.org/anthology/2020.lrec-1.887.Google Scholar
- Sanja Štajner, Hannah Bechara, and Horacio Saggion. 2015. A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation. In Proceedings of ACL&IJCNLP (Volume 2: Short Papers). 823–828.Google ScholarCross Ref
- Sanja Štajner and Ioana Hulpus. 2020. When Shallow is Good Enough: Automatic Assessment of Conceptual Text Complexity using Shallow Semantic Features. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language ResourcesAssociation, Marseille, France, 1414–1422. https://www.aclweb.org/anthology/2020.lrec-1.177.Google Scholar
- Sanja Štajner and Sergiu Nisioi. 2018. A Detailed Evaluation of Neural Sequence-to-Sequence Models for In-domain and Cross-domain Text Simplification. In Proceedings of the 11th Language Resources and Evaluation Conference (LREC).Google Scholar
- Ralph Weischedel, Eduard Hovy, Mitchell Marcus, Martha Palmer, Robert Belvin, Sameer Pradhan, Lance Ramshaw, and Nianwen Xue. 2011. OntoNotes: A Large Training Corpus for Enhanced Processing.Google Scholar
- Simple English Wikipedia. 2020. Instructions for the Authors of Simple English Wikipedia. https://simple.wikipedia.org/wiki/Main_Page.Google Scholar
- Kristian Woodsend and Mirella Lapata. 2011. Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP). 409–420.Google Scholar
- Wei Xu, Chris Callison-Burch, and Courtney Napoles. 2015. Problems in Current Text Simplification Research: New Data Can Help. Transactions of the Association for Computational Linguistics (TACL) 3 (2015), 283–297.Google ScholarCross Ref
- Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing Statistical Machine Translation for Text Simplification. Transactions of the Association for Computational Linguistics 4 (2016), 401–415.Google ScholarCross Ref
- Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. 2010. For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Los Angeles, California) (HLT ’10). Association for Computational Linguistics, Stroudsburg, PA, USA, 365–368. http://dl.acm.org/citation.cfm?id=1857999.1858055Google ScholarDigital Library
- Xingxing Zhang and Mirella Lapata. 2017. Sentence Simplification with Deep Reinforcement Learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 584–594.Google ScholarCross Ref
- Sanja Štajner, Richard Evans, Constantin Orasan, and Ruslan Mitkov. 2012. What can readability measures really tell us about text complexity?. In Proceedings of the LREC’12 Workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA) (23-25), Luz Rello and Horacio Saggion (Eds.). European Language Resources Association (ELRA), Istanbul, Turkey.Google Scholar
Recommendations
An English-translated parallel corpus for the CJK Wikipedia collections
ADCS '12: Proceedings of the Seventeenth Australasian Document Computing SymposiumIn this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information ...
Towards an on-demand simple Portuguese Wikipedia
SLPAT '11: Proceedings of the Second Workshop on Speech and Language Processing for Assistive TechnologiesThe Simple English Wikipedia provides a simplified version of Wikipedia's English articles for readers with special needs. However, there are fewer efforts to make information in Wikipedia in other languages accessible to a large audience. This work ...
Translation of simple English interrogative sentences to Marathi sentences
ICWET '10: Proceedings of the International Conference and Workshop on Emerging Trends in TechnologyThis paper presents a proposed system for machine translation of English interrogative sentences to their Marathi counterpart. The system takes simple interrogative English sentence as an input and performs its lexical analysis. Every token produced by ...
Comments