Fast-Champollion: a fast and robust sentence alignment algorithm

  • Authors:
  • Peng Li;Maosong Sun;Ping Xue

  • Affiliations:
  • National Lab for Information Science and Technology;National Lab for Information Science and Technology;The Boeing Company

  • Venue:
  • COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sentence-level aligned parallel texts are important resources for a number of natural language processing (NLP) tasks and applications such as statistical machine translation and cross-language information retrieval. With the rapid growth of online parallel texts, efficient and robust sentence alignment algorithms become increasingly important. In this paper, we propose a fast and robust sentence alignment algorithm, i.e., Fast-Champollion, which employs a combination of both length-based and lexicon-based algorithm. By optimizing the process of splitting the input bilingual texts into small fragments for alignment, Fast-Champollion, as our extensive experiments show, is 4.0 to 5.1 times as fast as the current baseline methods such as Champollion (Ma, 2006) on short texts and achieves about 39.4 times as fast on long texts, and Fast-Champollion is as robust as Champollion.