Resource report: building parallel text corpora for multi-domain translation system

Authors:
Hammam Riza Budiono;Chairil Hakim
Affiliations:
Science and Technology Network Information Center (IPTEKnet), Jakarta, Indonesia;Science and Technology Network Information Center (IPTEKnet), Jakarta, Indonesia
Venue:
ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Year:
2009

Citing 1
Cited 1

The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II

A-STAR: Toward translating Asian spoken languages

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel text is one of the most valuable resources for development of statistical machine translation systems and other NLP applications. However, manual translations are very costly, and the number of known parallel text is limited. Hence, our research started with creating and collecting a large amount of parallel text resources for Indonesian-English. We describe in this paper the creation of parallel corpora: ANTARA News, BPPT-PANL and BTEC-ATR. In order to be useful, these resources must be available in reasonable quantities and qualities to be useful for statistical approaches to language processing. We describe problem and solution as well robust tools and annotation schema to build and process these corpora.