New approach for collecting high quality parallel corpora from multilingual websites

  • Authors:
  • Cong Phap Huynh

  • Affiliations:
  • Danang University of Technology, Vietnam

  • Venue:
  • Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present a new approach for extracting the high quality (HQ) parallel corpora from multilingual resources. The original of our research compared to the previous works is the approach for gaining HQ data using for the Machine Translation domain. Almost previous approaches allowed to quickly acquire raw corpora, but not allow to gain HQ data. Our approach is a semi-automatic process including in a serial of steps that can automatically detect and download good multilingual Websites and parallel web pages to construct parallel corpora whose quality is well validated, revised, and enhanced collaboratively.