Resource report: building parallel text corpora for multi-domain translation system

  • Authors:
  • Hammam Riza Budiono;Chairil Hakim

  • Affiliations:
  • Science and Technology Network Information Center (IPTEKnet), Jakarta, Indonesia;Science and Technology Network Information Center (IPTEKnet), Jakarta, Indonesia

  • Venue:
  • ALR7 Proceedings of the 7th Workshop on Asian Language Resources
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Parallel text is one of the most valuable resources for development of statistical machine translation systems and other NLP applications. However, manual translations are very costly, and the number of known parallel text is limited. Hence, our research started with creating and collecting a large amount of parallel text resources for Indonesian-English. We describe in this paper the creation of parallel corpora: ANTARA News, BPPT-PANL and BTEC-ATR. In order to be useful, these resources must be available in reasonable quantities and qualities to be useful for statistical approaches to language processing. We describe problem and solution as well robust tools and annotation schema to build and process these corpora.