A bootstrapping method for extracting bilingual text pairs

Authors:
Hiroshi Masuichi;Raymond Flournoy;Stefan Kaufmann;Stanley Peters
Affiliations:
Fuji Xerox Co., Ltd., Kanagawa, Japan;Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Year:
2000

Citing 3
Cited 7

Information retrieval using a singular value decomposition model of latent semantic structure

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Querying across languages: a dictionary-based approach to multilingual information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A cooccurrence-based thesaurus and two applications to information retrieval

Information Processing and Management: an International Journal

Semi-automatic Compilation of Bilingual Lexicon Entries from Cross-Lingually Relevant News Articles on WWW News Sites

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Mining comparable bilingual text corpora for cross-language information integration

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Named entity transliteration with comparable corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Integrating Cross-Language Hierarchies and Its Application to Retrieving Relevant Documents

ACM Transactions on Asian Language Information Processing (TALIP)
Cross-lingual latent topic extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a method for extracting bilingual text pairs from a comparable corpus. The basic idea of the method is to apply bootstrapping to an existing corpus-based cross-language information retrieval (CLIR) approach. We conducted preliminary tests with English and Japanese bilingual corpora. The bootstrapping method led to much better results for the task of extracting translation pairs compared with a corpus-based CLIR method without boot-strapping, and the extracted translation pairs could be useful training data for improving results of the corpus-based CLIR method.