New approach for collecting high quality parallel corpora from multilingual websites

Authors:
Cong Phap Huynh
Affiliations:
Danang University of Technology, Vietnam
Venue:
Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
Year:
2011

Citing 6
Cited 0

ACTS: an automatic Chinese text segmentation system for full text retrieval

Journal of the American Society for Information Science
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
Optimal multi-paragraph text segmentation by dynamic programming

ACL '98 Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 2
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Automatic identification of parallel documents with light or without linguistic resources

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a new approach for extracting the high quality (HQ) parallel corpora from multilingual resources. The original of our research compared to the previous works is the approach for gaining HQ data using for the Machine Translation domain. Almost previous approaches allowed to quickly acquire raw corpora, but not allow to gain HQ data. Our approach is a semi-automatic process including in a serial of steps that can automatically detect and download good multilingual Websites and parallel web pages to construct parallel corpora whose quality is well validated, revised, and enhanced collaboratively.