Multimodal sn,k-grams: a skipping-based similarity model in information retrieval

Authors:
Pakinee Aimmanee;Thanaruk Theeramunkong
Affiliations:
Sirindhorn International Institute of Technology, Thammasat University, Patumthani, Thailand;Sirindhorn International Institute of Technology, Thammasat University, Patumthani, Thailand
Venue:
ACIIDS'10 Proceedings of the Second international conference on Intelligent information and database systems: Part I
Year:
2010

Citing 6
Cited 0

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Using linear algebra for intelligent information retrieval

SIAM Review
FLASH: A Fast Look-Up Algorithm for String Homology

Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Better filtering with gapped q-grams

Fundamenta Informaticae - Special issue on computing patterns in strings
s-grams: Defining generalized n-grams for information retrieval

Information Processing and Management: an International Journal
Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

A generalization of n-gram term modeling, namely sn,k-gram, has been recently proposed by allowing k-term skipping in the n-gram representation. This paper presents a so-called multi-modal sn,k- gram similarity which combines multiple similarity vectors resulting from computing similarity between several pairs of queries and documents each of which using s-grams with various n and k. Adjusting weights in the combination enables us to create a suitable approximate matching model between a relevant document and a query although such document does not include any exact terms as in the query or vice versa. To evaluate our proposed method, we analyzed two variants of a multimodal sn,k-gram model, called equal-weighting and performance-based-weighting over all queries on two collections of medical documents that are alike in context but different in written languages. The result shows that the multimodal sn,k-gram similarity significantly outperforms the conventional unigrams and bigrams.