Longest sorted sequence algorithm for parallel text alignment

  • Authors:
  • Tiago Ildefonso;Gabriel Pereira Lopes

  • Affiliations:
  • Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, Centre of Informatics and Information Technologies (CITI), Quinta da Torre, Caparica, Portugal;Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, Centre of Informatics and Information Technologies (CITI), Quinta da Torre, Caparica, Portugal

  • Venue:
  • EUROCAST'05 Proceedings of the 10th international conference on Computer Aided Systems Theory
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a language independent method for aligning parallel texts (texts that are translations of each other, or of a common source text), statistically supported. This new approach is inspired on previous work by Ribeiro et al (2000). The application of the second statistical filter, proposed by Ribeiro et al, based on Confidence Bands (CB), is substituted by the application of the Longest Sorted Sequence algorithm (LSSA). LSSA is described in this paper. As a result, 35% decrease in processing time and 18% increase in the number of aligned segments was obtained, for Portuguese-French alignments. Similar results were obtained regarding Portuguese-English alignments. Both methods are compared and evaluated, over a large parallel corpus made up of Portuguese, English and French parallel texts (approximately 250Mb of text per language).