Low-cost, high-performance translation retrieval: dumber is better

Authors:
Timothy Baldwin
Affiliations:
Tokyo Institute of Technology, Meguro-ku, Tokyo, Japan
Venue:
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Year:
2001

Citing 7
Cited 3

A comparison of indexing techniques for Japanese text retrieval

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
The String-to-String Correction Problem

Journal of the ACM (JACM)
The effects of word order and segmentation on translation retrieval performance

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Toward memory-based translation

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
CTM: an example-based translation aid system

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing

Retrieving meaning-equivalent sentences for example-based rough translation

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
A Reexamination of MRD-Based Word Sense Disambiguation

ACM Transactions on Asian Language Information Processing (TALIP)
The Japanese translation task: lexical and structural perspectives

SENSEVAL '01 The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we compare the relative effects of segment order, segmentation and segment contiguity on the retrieval performance of a translation memory system. We take a selection of both bag-of-words and segment order-sensitive string comparison methods, and run each over both character and word-segmented data, in combination with a range of local segment contiguity models (in the form of N-grams). Over two distinct datasets, we find that indexing according to simple character bigrams produces a retrieval accuracy superior to any of the tested word N-gram models. Further, in their optimum configuration, bag-of-words methods are shown to be equivalent to segment order-sensitive methods in terms of retrieval accuracy, but much faster. We also provide evidence that our findings are scalable.