Conceptual analysis of parallel corpus collected from the Web

Authors:
Kar Wing Li;Christopher C. Yang
Affiliations:
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong, People's Republic of China;Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong, People's Republic of China
Venue:
Journal of the American Society for Information Science and Technology
Year:
2006

Citing 9
Cited 4

A statistical approach to machine translation

Computational Linguistics
A survey of multilingual text retrieval

A survey of multilingual text retrieval
Experiments in multilingual information retrieval using the SPIDER system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Concept similarity and conceptual information alteration via English-to-Chinese and Chinese-to-English translation of medical article titles

Journal of the American Society for Information Science
Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Translingual alteration of conceptual information in medical translation: a crosslanguage analysis between English and Chinese

Journal of the American Society for Information Science
Automatic construction of English/Chinese parallel corpora

Journal of the American Society for Information Science and Technology
Automatic construction of parallel English-Chinese corpus for cross-language information retrieval

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Aligning sentences in parallel corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics

Introduction to the special topic section on multilingual information systems

Journal of the American Society for Information Science and Technology
An associate constraint network approach to extract multi-lingual information for crime analysis

Decision Support Systems
Cross-lingual thesaurus for multilingual knowledge management

Decision Support Systems
Automatic extraction of translations from web-based bilingual materials

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

As illustrated by the World Wide Web, the volume of information in languages other than English has grown significantly in recent years. This highlights the importance of multilingual corpora. Much effort has been devoted to the compilation of multilingual corpora for the purpose of cross-lingual information retrieval and machine translation. Existing parallel corpora mostly involve European languages, such as English–French and English–Spanish. There is still a lack of parallel corpora between European languages and Asian languages. In the authors' previous work, an alignment method to identify one-to-one Chinese and English title pairs was developed to construct an English–Chinese parallel corpus that works automatically from the World Wide Web, and a 100% precision and 87% recall were obtained. Careful analysis of these results has helped the authors to understand how the alignment method can be improved. A conceptual analysis was conducted, which includes the analysis of conceptual equivalent and conceptual information alternation in the aligned and nonaligned English–Chinese title pairs that are obtained by the alignment method. The result of the analysis not only reflects the characteristics of parallel corpora, but also gives insight into the strengths and weaknesses of the alignment method. In particular, conceptual alternation, such as omission and addition, is found to have a significant impact on the performance of the alignment method. © 2006 Wiley Periodicals, Inc.