Automatic acquisition of chinese–english parallel corpus from the web

Authors:
Ying Zhang;Ke Wu;Jianfeng Gao;Phil Vines
Affiliations:
RMIT University, Melbourne, Australia;Shanghai Jiaotong University, Shanghai, China;Microsoft Research, Redmond, Washington;RMIT University, Melbourne, Australia
Venue:
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Year:
2006

Citing 11
Cited 12

A statistical approach to machine translation

Computational Linguistics
Resolving ambiguity for cross-language retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
An Extension of the String-to-String Correction Problem

Journal of the ACM (JACM)
Quantifying the utility of parallel corpora

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Building Bilingual Dictionaries from Parallel Web Documents

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Instance Pruning Techniques

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Construction of a Fuzzy Multilingual Thesaurus and Its Application to Cross-Lingual Text Retrieval

WI '01 Proceedings of the First Asia-Pacific Conference on Web Intelligence: Research and Development
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Embedding web-based statistical translation models in cross-language information retrieval

Computational Linguistics - Special issue on web as corpus
Discovering parallel text from the World Wide Web

ACSW Frontiers '04 Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation - Volume 32

A study of statistical models for query translation: finding a good unit of translation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A DOM tree alignment model for mining parallel data from the web

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Statistical query translation models for cross-language information retrieval

ACM Transactions on Asian Language Information Processing (TALIP)
An Intelligent Web Agent to Mine Bilingual Parallel Pages via Automatic Discovery of URL Pairing Patterns

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops
Constructing a large scale text corpus based on the grid and trustworthiness

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
A refinement framework for cross language text categorization

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
A novel method for bilingual web page acquisition from search engine web records

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Information Processing and Management: an International Journal
Graph-based bilingual sentence alignment from large scale web pages

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
Progress in information retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Exploiting poly-lingual documents for improving text categorization effectiveness

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel corpora are a valuable resource for tasks such as cross-language information retrieval and data-driven natural language processing systems. Previously only small scale corpora have been available, thus restricting their practical use. This paper describes a system that overcomes this limitation by automatically collecting high quality parallel bilingual corpora from the web. Previous systems used a single principle feature for parallel web page verification, whereas we use multiple features to identify parallel texts via a k-nearest-neighbor classifier. Our system was evaluated using a data set containing 6500 Chinese–English candidate parallel pairs that have been manually annotated. Experiments show that the use of a k-nearest-neighbors classifier with multiple features achieves substantial improvements over the systems that use any one of these features. The system achieved a precision rate of 95% and a recall rate of 97%, and thus is a significant improvement over earlier work.