Automatic acquisition of chinese–english parallel corpus from the web

  • Authors:
  • Ying Zhang;Ke Wu;Jianfeng Gao;Phil Vines

  • Affiliations:
  • RMIT University, Melbourne, Australia;Shanghai Jiaotong University, Shanghai, China;Microsoft Research, Redmond, Washington;RMIT University, Melbourne, Australia

  • Venue:
  • ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Parallel corpora are a valuable resource for tasks such as cross-language information retrieval and data-driven natural language processing systems. Previously only small scale corpora have been available, thus restricting their practical use. This paper describes a system that overcomes this limitation by automatically collecting high quality parallel bilingual corpora from the web. Previous systems used a single principle feature for parallel web page verification, whereas we use multiple features to identify parallel texts via a k-nearest-neighbor classifier. Our system was evaluated using a data set containing 6500 Chinese–English candidate parallel pairs that have been manually annotated. Experiments show that the use of a k-nearest-neighbors classifier with multiple features achieves substantial improvements over the systems that use any one of these features. The system achieved a precision rate of 95% and a recall rate of 97%, and thus is a significant improvement over earlier work.