Approximate string-matching with q-grams and maximal matches
Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
The String-to-String Correction Problem
Journal of the ACM (JACM)
Introduction to Algorithms
Computation of Normalized Edit Distance and Applications
IEEE Transactions on Pattern Analysis and Machine Intelligence
Bitext maps and alignment via pattern recognition
Computational Linguistics
A cheap and fast way to build useful translation lexicons
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A general framework for subjective information extraction from unstructured English text
Data & Knowledge Engineering
Semantic text similarity using corpus-based word similarity and string similarity
ACM Transactions on Knowledge Discovery from Data (TKDD)
Applications of corpus-based semantic similarity and word segmentation to database schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Evaluation of string distance algorithms for dialectology
LD '06 Proceedings of the Workshop on Linguistic Distances
Query assistant based on experience capitalization for information retrieval systems
HSI'09 Proceedings of the 2nd conference on Human System Interactions
Real-word spelling correction using Google Web IT 3-grams
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Automated country name disambiguation for code set alignment
ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Information Systems Frontiers
Bootstrapped named entity recognition for product attribute extraction
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
EACL 2012 Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH
A lazy man's way to part-of-speech tagging
PKAW'12 Proceedings of the 12th Pacific Rim conference on Knowledge Management and Acquisition for Intelligent Systems
String similarity measures and joins with synonyms
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Electronic Commerce Research and Applications
Hi-index | 0.00 |
In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.