Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

Authors:
Anni Järvelin;Antti Järvelin
Affiliations:
Department of Information Studies, University of Tampere, Finland FIN-33014;Department of Computer Sciences, University of Tampere, Finland FIN-33014
Venue:
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Year:
2008

Citing 11
Cited 1

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Phonetic string matching: lessons from information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval effectiveness of proper name search methods

Information Processing and Management: an International Journal
Employing the resolution power of search keys

Journal of the American Society for Information Science and Technology
Principles of data mining

Principles of data mining
Introduction

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Fuzzy translation of cross-lingual spelling variants

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Better filtering with gapped q-grams

Fundamenta Informaticae - Special issue on computing patterns in strings
Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002

Information Retrieval
Technical issues of cross-language information retrieval: a review

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
s-grams: Defining generalized n-grams for information retrieval

Information Processing and Management: an International Journal

Multimodal sn,k-grams: a skipping-based similarity model in information retrieval

ACIIDS'10 Proceedings of the Second international conference on Intelligent information and database systems: Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classified s -grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s -grams have consistently outperformed other approximate string matching techniques, like edit distance or n -grams. The Jaccard coefficient has traditionally been used as an s -gram based string proximity measure. However, other proximity measures for s -gram matching have not been tested. In the current study the performance of seven proximity measures for classified s -grams in CLIR context was evaluated using eleven language pairs. The binary proximity measures performed generally better than their non-binary counterparts, but the difference depended mainly on the padding used with s -grams. When no padding was used, the binary and non-binary proximity measures were nearly equal, though the performance at large deteriorated.