An unsupervised alignment algorithm for text simplification corpus construction

  • Authors:
  • Stefan Bott;Horacio Saggion

  • Affiliations:
  • Universitat Pompeu Fabra, C/Tanger - Barcelona, Spain;Universitat Pompeu Fabra, C/Tanger - Barcelona, Spain

  • Venue:
  • MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a method for the sentence-level alignment of short simplified text to the original text from which they were adapted. Our goal is to align a medium-sized corpus of parallel text, consisting of short news texts in Spanish with their simplified counterpart. No training data is available for this task, so we have to rely on unsupervised learning. In contrast to bilingual sentence alignment, in this task we can exploit the fact that the probability of sentence correspondence can be estimated from lexical similarity between sentences. We show that the algoithm employed performs better than a baseline which approaches the problem with a TF*IDF sentence similarity metric. The alignment algorithm is being used for the creation of a corpus for the study of text simplification in the Spanish language.