Combining sentence length with location information to align monolingual parallel texts

  • Authors:
  • Weigang Li;Ting Liu;Sheng Li

  • Affiliations:
  • Harbin Institute of Technology, Information Retrieval Laboratory, School of Computer Science and Technology, Harbin, P.R. China;Harbin Institute of Technology, Information Retrieval Laboratory, School of Computer Science and Technology, Harbin, P.R. China;Harbin Institute of Technology, Information Retrieval Laboratory, School of Computer Science and Technology, Harbin, P.R. China

  • Venue:
  • AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abundant Chinese paraphrasing resource on Internet can be attained from different Chinese translations of one foreign masterpiece. Paraphrases corpus is the corpus that includes sentence pairs to convey the same information. The irregular characteristics of the real monolingual parallel texts, especially without the strictly aligned paragraph boundaries between two translations, bring a challenge to alignment technology. The traditional alignment methods on bilingual texts have some difficulties in competency for doing this. A new method for aligning real monolingual parallel texts using sentence pair's length and location information is described in this paper. The model was motivated by the observation that the location of a sentence pair with certain length is distributed in the whole text similarly. And presently, a paraphrases corpus with about fifty thousand sentence pairs is constructed.