Learning simple Wikipedia: a cogitation in ascertaining abecedarian language

Authors:
Courtney Napoles;Mark Dredze
Affiliations:
Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD
Venue:
CL&W '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids
Year:
2010

Citing 23
Cited 4

A maximum entropy approach to natural language processing

Computational Linguistics
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Cohesive Generation of Syntactically Simplified Newspaper Text

TDS '00 Proceedings of the Third International Workshop on Text, Speech and Dialogue
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
The Wikipedia XML corpus

ACM SIGIR Forum
Helping aphasic people process online information

Proceedings of the 8th international ACM SIGACCESS conference on Computers and accessibility
NLTK: the natural language toolkit

ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Coarse-to-fine n-best parsing and MaxEnt discriminative reranking

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Paraphrasing with bilingual parallel corpora

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Learning accurate, compact, and interpretable tree annotation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Unsupervised Multilingual Sentence Boundary Detection

Computational Linguistics
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Confidence-weighted linear classification

Proceedings of the 25th international conference on Machine learning
BART: a modular toolkit for coreference resolution

HLT-Demonstrations '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session
Predicting the fluency of text with shallow structural features: case studies of machine translation and human-written text

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Cognitively motivated features for readability assessment

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Sentence level machine translation evaluation as a ranking problem: one step aside from BLEU

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Self-training PCFG grammars with latent annotations across languages

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
A joint language model with fine-grain syntactic tags

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Automatic content-based categorization of Wikipedia articles

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Putting it simply: a context-aware approach to lexical simplification

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Simple English Wikipedia: a new text simplification task

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Learning to simplify sentences using Wikipedia

MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation
Towards an on-demand simple Portuguese Wikipedia

SLPAT '11 Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text simplification is the process of changing vocabulary and grammatical structure to create a more accessible version of the text while maintaining the underlying information and content. Automated tools for text simplification are a practical way to make large corpora of text accessible to a wider audience lacking high levels of fluency in the corpus language. In this work, we investigate the potential of Simple Wikipedia to assist automatic text simplification by building a statistical classification system that discriminates simple English from ordinary English. Most text simplification systems are based on hand-written rules (e.g., PEST (Carroll et al., 1999) and its module SYSTAR (Canning et al., 2000)), and therefore face limitations scaling and transferring across domains. The potential for using Simple Wikipedia for text simplification is significant; it contains nearly 60,000 articles with revision histories and aligned articles to ordinary English Wikipedia. Using articles from Simple Wikipedia and ordinary Wikipedia, we evaluated different classifiers and feature sets to identify the most discriminative features of simple English for use across domains. These findings help further understanding of what makes text simple and can be applied as a tool to help writers craft simple text.