s-grams: Defining generalized n-grams for information retrieval

Authors:
Anni Järvelin;Antti Järvelin;Kalervo Järvelin
Affiliations:
University of Tampere, Department of Information Studies, FIN-33014 University of Tampere, Finland;University of Tampere, Department of Computer Sciences, FIN-33014 University of Tampere, Finland;University of Tampere, Department of Information Studies, FIN-33014 University of Tampere, Finland
Venue:
Information Processing and Management: an International Journal
Year:
2007

Citing 14
Cited 13

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Phonetic string matching: lessons from information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval effectiveness of proper name search methods

Information Processing and Management: an International Journal
Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of a simple and effective music information retrieval method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Approximate String Matching

ACM Computing Surveys (CSUR)
A technique for computer detection and correction of spelling errors

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Music ranking techniques evaluated

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
FLASH: A Fast Look-Up Algorithm for String Homology

Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Translating cross-lingual spelling variants using transformation rules

Information Processing and Management: an International Journal
Introduction to Automata Theory, Languages, and Computation (3rd Edition)

Introduction to Automata Theory, Languages, and Computation (3rd Edition)

Data driven methods for improving mono- and cross-lingual IR performance in noisy environments

Proceedings of the second workshop on Analytics for noisy unstructured text data
Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Interactive relevance feedback with graded relevance and sentence extraction: simulated user experiments

Proceedings of the 18th ACM conference on Information and knowledge management
Query assistant based on experience capitalization for information retrieval systems

HSI'09 Proceedings of the 2nd conference on Human System Interactions
Effectiveness of methods for syntactic and semantic recognition of numeral strings: tradeoffs between number of features and length of word N-grams

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
JHU ad hoc experiments at CLEF 2008

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Multimodal sn,k-grams: a skipping-based similarity model in information retrieval

ACIIDS'10 Proceedings of the Second international conference on Intelligent information and database systems: Part I
Identifying task-based sessions in search engine query logs

Proceedings of the fourth ACM international conference on Web search and data mining
Generating suggestions for queries in the long tail with an inverted index

Information Processing and Management: an International Journal
Solving multi-label text categorization problem using support vector machine approach with membership function

Neurocomputing
Non-syntactic word prediction for AAC

SLPAT '12 Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies
Discovering tasks from search engine query logs

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.02

Visualization

Abstract

n-grams have been used widely and successfully for approximate string matching in many areas. s-grams have been introduced recently as an n-gram based matching technique, where di-grams are formed of both adjacent and non-adjacent characters. s-grams have proved successful in approximate string matching across language boundaries in Information Retrieval (IR). s-grams however lack precise definitions. Also their similarity comparison lacks precise definition. In this paper, we give precise definitions for both. Our definitions are developed in a bottom-up manner, only assuming character strings and elementary mathematical concepts. Extending established practices, we provide novel definitions of s-gram profiles and the L"1 distance metric for them. This is a stronger string proximity measure than the popular Jaccard similarity measure because Jaccard is insensitive to the counts of each n-gram in the strings to be compared. However, due to the popularity of Jaccard in IR experiments, we define the reduction of s-gram profiles to binary profiles in order to precisely define the (extended) Jaccard similarity function for s-grams. We also show that n-gram similarity/distance computations are special cases of our generalized definitions.