Approximate string-matching with q-grams and maximal matches
Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Phonetic string matching: lessons from information retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval effectiveness of proper name search methods
Information Processing and Management: an International Journal
Chinese text retrieval without using a dictionary
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of a simple and effective music information retrieval method
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
ACM Computing Surveys (CSUR)
A technique for computer detection and correction of spelling errors
Communications of the ACM
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Music ranking techniques evaluated
ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
FLASH: A Fast Look-Up Algorithm for String Homology
Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Character N-Gram Tokenization for European Language Text Retrieval
Information Retrieval
Translating cross-lingual spelling variants using transformation rules
Information Processing and Management: an International Journal
Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Data driven methods for improving mono- and cross-lingual IR performance in noisy environments
Proceedings of the second workshop on Analytics for noisy unstructured text data
Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Addressing morphological variation in alphabetic languages
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 18th ACM conference on Information and knowledge management
Query assistant based on experience capitalization for information retrieval systems
HSI'09 Proceedings of the 2nd conference on Human System Interactions
AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
JHU ad hoc experiments at CLEF 2008
CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Multimodal sn,k-grams: a skipping-based similarity model in information retrieval
ACIIDS'10 Proceedings of the Second international conference on Intelligent information and database systems: Part I
Identifying task-based sessions in search engine query logs
Proceedings of the fourth ACM international conference on Web search and data mining
Generating suggestions for queries in the long tail with an inverted index
Information Processing and Management: an International Journal
Non-syntactic word prediction for AAC
SLPAT '12 Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies
Discovering tasks from search engine query logs
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.02 |
n-grams have been used widely and successfully for approximate string matching in many areas. s-grams have been introduced recently as an n-gram based matching technique, where di-grams are formed of both adjacent and non-adjacent characters. s-grams have proved successful in approximate string matching across language boundaries in Information Retrieval (IR). s-grams however lack precise definitions. Also their similarity comparison lacks precise definition. In this paper, we give precise definitions for both. Our definitions are developed in a bottom-up manner, only assuming character strings and elementary mathematical concepts. Extending established practices, we provide novel definitions of s-gram profiles and the L"1 distance metric for them. This is a stronger string proximity measure than the popular Jaccard similarity measure because Jaccard is insensitive to the counts of each n-gram in the strings to be compared. However, due to the popularity of Jaccard in IR experiments, we define the reduction of s-gram profiles to binary profiles in order to precisely define the (extended) Jaccard similarity function for s-grams. We also show that n-gram similarity/distance computations are special cases of our generalized definitions.