Automatic generation of Japanese–English bilingual thesauri based on bilingual corpora

Authors:
Keita Tsuji;Kyo Kageura
Affiliations:
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan;National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
Venue:
Journal of the American Society for Information Science and Technology - Research Articles
Year:
2006

Citing 26
Cited 5

Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Translating collocations for bilingual lexicons: a statistical approach

Computational Linguistics
Multilingual information discovery and access (MIDAS)

Proceedings of the fourth ACM conference on Digital libraries
Encyclopedia of Artificial Intelligence

Encyclopedia of Artificial Intelligence
Statistical transliteration for english-arabic cross language information retrieval

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Models of translational equivalence among words

Computational Linguistics
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
A class-based approach to word alignment

Computational Linguistics
Automating the acquisition of bilingual terminology

EACL '93 Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics
Machine transliteration

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Automatic English-Chinese name transliteration for development of multilingual resources

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
A simple hybrid aligner for generating lexical correspondences in parallel texts

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Flow network models for word alignment and terminology extraction from bilingual corpora

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
An algorithm for finding noun phrase correspondences in bilingual corpora

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A pattern matching method for finding noun and proper noun translations from noisy parallel corpora

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
An algorithm for simultaneously bracketing parallel texts by aligning words

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Automatic thesaurus generation through multiple filtering

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
English-to-Korean transliteration using multiple unbounded overlapping phoneme chunks

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Towards automatic extraction of monolingual and bilingual terminology

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Alignment of shared forests for bilingual corpora

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Finding structural correspondences from bilingual parsed corpus for corpus-based translation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Acquisition of phrase-level bilingual correspondence using dependency structure

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Word alignment of English-Chinese bilingual corpus based on chunks

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Extensions to HMM-based statistical word alignment models

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Minimum Bayes-Risk word alignments of bilingual texts

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10

Translation disambiguation for cross-language information retrieval using context-based translation probability

Journal of Information Science
An approach for extracting bilingual terminology from Wikipedia

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
A bilingual dictionary extracted from the Wikipedia link structure

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
Dialect translation: integrating Bayesian co-segmentation models with pivot-based SMT

DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties
A Bayesian Alignment Approach to Transliteration Mining

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The authors propose a method for automatically generating Japanese–English bilingual thesauri based on bilingual corpora. The term bilingual thesaurus refers to a set of bilingual equivalent words and their synonyms. Most of the methods proposed so far for extracting bilingual equivalent word clusters from bilingual corpora depend heavily on word frequency and are not effective for dealing with low-frequency clusters. These low-frequency bilingual clusters are worth extracting because they contain many newly coined terms that are in demand but are not listed in existing bilingual thesauri. Assuming that single language-pair-independent methods such as frequency-based ones have reached their limitations and that a language-pair-dependent method used in combination with other methods shows promise, the authors propose the following approach: (a) Extract translation pairs based on transliteration patterns; (b) remove the pairs from among the candidate words; (c) extract translation pairs based on word frequency from the remaining candidate words; and (d) generate bilingual clusters based on the extracted pairs using a graph-theoretic method. The proposed method has been found to be significantly more effective than other methods. © 2006 Wiley Periodicals, Inc.