research-article

Is Simple English Wikipedia As Simple And Easy-to-Understand As We Expect It To Be?

Authors:
Sanja Stajner

ReadableAI, DE

ReadableAI, DE
View Profile

,
Sergiu Nisioi

ReadableAI, Germany

ReadableAI, Germany
View Profile

,
Daniel Ibanez

ReadableAI, Germany

ReadableAI, Germany
View Profile

DSAI '20: Proceedings of the 9th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusionDecember 2020Pages 66–70https://doi.org/10.1145/3439231.3439263

Published:09 June 2021Publication History

DSAI '20: Proceedings of the 9th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusion

Pages 66–70

ABSTRACT

Conceptual complexity of a written text plays an important role in maintaining reader's interest in reading it. Therefore, automatic text simplification systems should, apart from considering lexical and syntactic complexity of a text, also consider the conceptual complexity. In this study, we analyze and compare two widely used English text simplification corpora, one professionally produced (Newsela) and the other collaboratively made by amateurs and enthusiasts (English Wikipedia–Simple English Wikipedia), focusing on 19 conceptual complexity features. The results indicated that simplification operations made during the production of Simple English Wikipedia in many cases do not follow the patterns of the professionally simplified corpora, thus casting doubts on adequacy of using Simple English Wikipedia as training material for automatic text simplification systems.

References

Marcelo Amancio and Lucia Specia. 2014. An Analysis of Crowdsourced Text Simplifications. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR). Association for Computational Linguistics, Gothenburg, Sweden, 123–130. https://doi.org/10.3115/v1/W14-1214.Google ScholarCross Ref
Barbara Arfé, Lucia Mason, and Inmaculada Fajardo. 2017. Simplifying informational text structure for struggling readers. Reading and Writing (24 Oct 2017).Google Scholar
William Coster and David Kauchak. 2011. Learning to Simplify Sentences Using Wikipedia. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). 1–9.Google Scholar
William Coster and David Kauchak. 2011. Simple English Wikipedia: a new text simplification task. In Proceedings of ACL&HLT. 665–669.Google Scholar
Dan Feblowitz and David Kauchak. 2013. Sentence Simplification as Tree Transduction. In Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations. 1–10. http://www.aclweb.org/anthology/W13-2901.Google Scholar
Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. (2017).Google Scholar
Colby Horn, Cathryn Manduca, and David Kauchak. 2014. Learning a Lexical Simplifier Using Wikipedia. In Proceedings of ACL 2014 (Short Papers). 458–463.Google ScholarCross Ref
Ioana Hulpuş, Narumol Prangnawarat, and Conor Hayes. 2015. Path-Based Semantic Relatedness on Linked Data and Its Use to Word and Entity Disambiguation. In the Semantic Web - ISWC 2015. Springer International Publishing, Cham, 442–457.Google Scholar
Ioana Hulpus, Sanja Štajner, and Heiner Stuckenschmidt. 2019. A Spreading Activation Framework for Tracking Conceptual Complexity of Texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 3878–3887. https://doi.org/10.18653/v1/P19-1377.Google ScholarCross Ref
William Hwang, Hannaneh Hajishirzi, Mari Ostendorf, andWeiWu. 2015. Aligning Sentences from Standard Wikipedia to Simple Wikipedia. In Proceedings of NAACL&HLT, pp. 211–217.Google ScholarCross Ref
David Kauchak. 2013. Improving Text Simplification Language Modeling Using Unsimplified Text Data. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ACL, 1537–1546.Google Scholar
W. Kintsch and T. A. van Dijk. 1978. Towards a model of text comprehension and production. Psychological Review 85 (1978), pp. 363–394.Google ScholarCross Ref
D. S. McNamara, A. Graesser, and M. Louwerse. 2012. Sources of text difficulty: Across the ages and genres. Lanham, MD: Rowman & Littlefield Education.Google Scholar
Newsela. 2016. Newsela Article Corpus. https://newsela.com/data. Version: 2016-01-29.Google Scholar
Sergiu Nisioi, Sanja Štajner, Simone Paolo Ponzetto, and Liviu P. Dinu. 2017. Exploring Neural Text Simplification Models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). 85–91.Google Scholar
Sanja Štajner, Hannah Béchara, and Horacio Saggion. 2015. A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Beijing, China, 823–828. https://doi.org/10.3115/v1/P15-2135.Google Scholar
Sanja Štajner and Ioana Hulpuş. 2018. Automatic Assessment of Conceptual Text Complexity Using Knowledge Graphs. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 318–330. https://www.aclweb.org/anthology/C18-1027.Google Scholar
Sanja Stajner, Sergiu Nisioi, and Ioana Hulpus. 2020. CoCo: A Tool for Automatically Assessing Conceptual Complexity of Texts. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 7179–7186. https://www.aclweb.org/anthology/2020.lrec-1.887.Google Scholar
Sanja Štajner, Hannah Bechara, and Horacio Saggion. 2015. A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation. In Proceedings of ACL&IJCNLP (Volume 2: Short Papers). 823–828.Google ScholarCross Ref
Sanja Štajner and Ioana Hulpus. 2020. When Shallow is Good Enough: Automatic Assessment of Conceptual Text Complexity using Shallow Semantic Features. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language ResourcesAssociation, Marseille, France, 1414–1422. https://www.aclweb.org/anthology/2020.lrec-1.177.Google Scholar
Sanja Štajner and Sergiu Nisioi. 2018. A Detailed Evaluation of Neural Sequence-to-Sequence Models for In-domain and Cross-domain Text Simplification. In Proceedings of the 11th Language Resources and Evaluation Conference (LREC).Google Scholar
Ralph Weischedel, Eduard Hovy, Mitchell Marcus, Martha Palmer, Robert Belvin, Sameer Pradhan, Lance Ramshaw, and Nianwen Xue. 2011. OntoNotes: A Large Training Corpus for Enhanced Processing.Google Scholar
Simple English Wikipedia. 2020. Instructions for the Authors of Simple English Wikipedia. https://simple.wikipedia.org/wiki/Main_Page.Google Scholar
Kristian Woodsend and Mirella Lapata. 2011. Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP). 409–420.Google Scholar
Wei Xu, Chris Callison-Burch, and Courtney Napoles. 2015. Problems in Current Text Simplification Research: New Data Can Help. Transactions of the Association for Computational Linguistics (TACL) 3 (2015), 283–297.Google ScholarCross Ref
Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing Statistical Machine Translation for Text Simplification. Transactions of the Association for Computational Linguistics 4 (2016), 401–415.Google ScholarCross Ref
Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. 2010. For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Los Angeles, California) (HLT ’10). Association for Computational Linguistics, Stroudsburg, PA, USA, 365–368. http://dl.acm.org/citation.cfm?id=1857999.1858055Google ScholarDigital Library
Xingxing Zhang and Mirella Lapata. 2017. Sentence Simplification with Deep Reinforcement Learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 584–594.Google ScholarCross Ref
Sanja Štajner, Richard Evans, Constantin Orasan, and Ruslan Mitkov. 2012. What can readability measures really tell us about text complexity?. In Proceedings of the LREC’12 Workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA) (23-25), Luz Rello and Horacio Saggion (Eds.). European Language Resources Association (ELRA), Istanbul, Turkey.Google Scholar

Recommendations

An English-translated parallel corpus for the CJK Wikipedia collections
ADCS '12: Proceedings of the Seventeenth Australasian Document Computing Symposium

In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information ...
Read More
Towards an on-demand simple Portuguese Wikipedia
SLPAT '11: Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

The Simple English Wikipedia provides a simplified version of Wikipedia's English articles for readers with special needs. However, there are fewer efforts to make information in Wikipedia in other languages accessible to a large audience. This work ...
Read More
Translation of simple English interrogative sentences to Marathi sentences
ICWET '10: Proceedings of the International Conference and Workshop on Emerging Trends in Technology

This paper presents a proposed system for machine translation of English interrogative sentences to their Marathi counterpart. The system takes simple interrogative English sentence as an input and performs its lexical analysis. Every token produced by ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DSAI '20: Proceedings of the 9th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusion
December 2020
245 pages
ISBN:9781450389372
DOI:10.1145/3439231

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Newsela
Simple English Wikipedia
conceptual complexity
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate17of23submissions,74%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 64
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Is Simple English Wikipedia As Simple And Easy-to-Understand As We Expect It To Be?

DSAI '20: Proceedings of the 9th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusion

ABSTRACT

References

Cited By

Recommendations

An English-translated parallel corpus for the CJK Wikipedia collections

Towards an on-demand simple Portuguese Wikipedia

Translation of simple English interrogative sentences to Marathi sentences

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Is Simple English Wikipedia As Simple And Easy-to-Understand As We Expect It To Be?

DSAI '20: Proceedings of the 9th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusion

ABSTRACT

References

Cited By

Recommendations

An English-translated parallel corpus for the CJK Wikipedia collections

Towards an on-demand simple Portuguese Wikipedia

Translation of simple English interrogative sentences to Marathi sentences

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media