An unsupervised alignment algorithm for text simplification corpus construction

Authors:
Stefan Bott;Horacio Saggion
Affiliations:
Universitat Pompeu Fabra, C/Tanger - Barcelona, Spain;Universitat Pompeu Fabra, C/Tanger - Barcelona, Spain
Venue:
MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation
Year:
2011

Citing 8
Cited 4

Using hidden Markov modeling to decompose human-written summaries

Computational Linguistics - Summarization
An Architecture for a Text Simplification System

LEC '02 Proceedings of the Language Engineering Conference (LEC'02)
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
Motivations and methods for text simplification

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Sentence alignment for monolingual comparable corpora

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Towards Brazilian Portuguese automatic text simplification systems

Proceedings of the eighth ACM symposium on Document engineering
Cognitively motivated features for readability assessment

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
A monolingual tree-based translation model for sentence simplification

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics

DysWebxia: a model to improve accessibility of the textual web for dyslexic users

ACM SIGACCESS Accessibility and Computing
Automatic simplification of spanish text for e-accessibility

ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part I
Towards automatic lexical simplification in Spanish: an empirical study

PITR '12 Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations
Text simplification resources for Spanish

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a method for the sentence-level alignment of short simplified text to the original text from which they were adapted. Our goal is to align a medium-sized corpus of parallel text, consisting of short news texts in Spanish with their simplified counterpart. No training data is available for this task, so we have to rely on unsupervised learning. In contrast to bilingual sentence alignment, in this task we can exploit the fact that the probability of sentence correspondence can be estimated from lexical similarity between sentences. We show that the algoithm employed performs better than a baseline which approaches the problem with a TF*IDF sentence similarity metric. The alignment algorithm is being used for the creation of a corpus for the study of text simplification in the Spanish language.