Automatic construction of English/Chinese parallel corpora

Authors:
Christopher C. Yang;Kar Wing Li
Affiliations:
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong;Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
Venue:
Journal of the American Society for Information Science and Technology
Year:
2003

Citing 23
Cited 23

A bit-string longest-common-subsequence algorithm

Information Processing Letters
Introduction to algorithms

Introduction to algorithms
A statistical approach to machine translation

Computational Linguistics
Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
Change detection in hierarchically structured information

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A survey of multilingual text retrieval

A survey of multilingual text retrieval
Experiments in multilingual information retrieval using the SPIDER system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
On Chinese text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Automatic adaptation of proper noun dictionaries through cooperation of machine learning and probabilistic methods

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Translingual alteration of conceptual information in medical translation: a crosslanguage analysis between English and Chinese

Journal of the American Society for Information Science
On the use of words and n-grams for Chinese information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Automatic construction of parallel English-Chinese corpus for cross-language information retrieval

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Semi-automatic acquisition of domain-specific translation lexicons

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
The BICORD system: combining lexical information from bilingual corpora and machine readable dictionaries

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
Word completion: a first step toward target-text mediated IMT

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
Using machine learning to maintain rule-based named-entity recognition and classification systems

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
The Candide system for machine translation

HLT '94 Proceedings of the workshop on Human Language Technology

Translating unknown queries with web corpora for cross-language information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Changes in queries in Gnutella peer-to-peer networks

Journal of Information Science
Conceptual analysis of parallel corpus collected from the Web

Journal of the American Society for Information Science and Technology
Exploiting the Web as the multilingual corpus for unknown query translation

Journal of the American Society for Information Science and Technology
The impact analysis of language differences on an automatic multilingual text summarization system

Journal of the American Society for Information Science and Technology
Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources

ACM Transactions on Asian Language Information Processing (TALIP)
Creating multilingual translation lexicons with regional variations using web corpora

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
An associate constraint network approach to extract multi-lingual information for crime analysis

Decision Support Systems
Ontology-based speech act identification in a bilingual dialog system using partial pattern trees

Journal of the American Society for Information Science and Technology
Editors' introduction special issue on multilingual knowledge management

Decision Support Systems
Using Web resources to construct multilingual medical thesaurus for cross-language medical information retrieval

Decision Support Systems
Cross-lingual thesaurus for multilingual knowledge management

Decision Support Systems
On the use of comparable corpora to improve SMT performance

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Exploiting comparable corpora with TER and TERp

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Automatic construction of cross-lingual networks of concepts from the Hong Kong SAR police department

ISI'03 Proceedings of the 1st NSF/NIJ conference on Intelligence and security informatics
An empirical study on web mining of parallel data

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Information Processing and Management: an International Journal
Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
A relevance feedback model for fractal summarization

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
Construct trilingual parallel corpus on demand

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
Finding translations in scanned book collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Exploiting poly-lingual documents for improving text categorization effectiveness

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. In order to cross the boundaries that exist between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in both genre and domain. It is also impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesauri for large applications. Corpus-based approaches, which do not have the limitation of dictionaries, provide a statistical translation model with which to cross the language boundary. There are many domain-specific parallel or comparable corpora that are employed in machine translation and cross-lingual information retrieval. Most of these are corpora between Indo-European languages, such as English/French and English/Spanish. The Asian/Indo-European corpus, especially English/Chinese corpus, is relatively sparse. The objective of the present research is to construct English/ Chinese parallel corpus automatically from the World Wide Web. In this paper, an alignment method is presented which is based on dynamic programming to identify the one-to-one Chinese and English title pairs. The method includes alignment at title level, word level and character level. The longest common subsequence (LCS) is applied to find the most reliable Chinese translation of an English word. As one word for a language may translate into two or more words repetitively in another language, the edit operation, deletion, is used to resolve redundancy. A score function is then proposed to determine the optimal title pairs. Experiments have been conducted to investigate the performance of the proposed method using the daily press release articles by the Hong Kong SAR government as the test bed. The precision of the result is 0.998 while the recall is 0.806. The release articles and speech articles, published by Hongkong & Shanghai Banking Corporation Limited, are also used to test our method, the precision is 1.00, and the recall is 0.948.