ACTS: an automatic Chinese text segmentation system for full text retrieval
Journal of the American Society for Information Science
Computational Linguistics - Special issue on using large corpora: I
Optimal multi-paragraph text segmentation by dynamic programming
ACL '98 Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 2
A program for aligning sentences in bilingual corpora
ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Extracting parallel sub-sentential fragments from non-parallel corpora
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Automatic identification of parallel documents with light or without linguistic resources
AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Hi-index | 0.00 |
In this paper, we present a new approach for extracting the high quality (HQ) parallel corpora from multilingual resources. The original of our research compared to the previous works is the approach for gaining HQ data using for the Machine Translation domain. Almost previous approaches allowed to quickly acquire raw corpora, but not allow to gain HQ data. Our approach is a semi-automatic process including in a serial of steps that can automatically detect and download good multilingual Websites and parallel web pages to construct parallel corpora whose quality is well validated, revised, and enhanced collaboratively.