N-gram similarity and distance

Authors:
Grzegorz Kondrak
Affiliations:
Department of Computing Science, University of Alberta, Edmonton, AB, Canada
Venue:
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Year:
2005

Citing 6
Cited 14

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
The String-to-String Correction Problem

Journal of the ACM (JACM)
Introduction to Algorithms

Introduction to Algorithms
Computation of Normalized Edit Distance and Applications

IEEE Transactions on Pattern Analysis and Machine Intelligence
Bitext maps and alignment via pattern recognition

Computational Linguistics
A cheap and fast way to build useful translation lexicons

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1

A general framework for subjective information extraction from unstructured English text

Data & Knowledge Engineering
Semantic text similarity using corpus-based word similarity and string similarity

ACM Transactions on Knowledge Discovery from Data (TKDD)
Applications of corpus-based semantic similarity and word segmentation to database schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Evaluation of string distance algorithms for dialectology

LD '06 Proceedings of the Workshop on Linguistic Distances
Query assistant based on experience capitalization for information retrieval systems

HSI'09 Proceedings of the 2nd conference on Human System Interactions
Real-word spelling correction using Google Web IT 3-grams

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Automated country name disambiguation for code set alignment

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Semantic enrichment process: An approach to software component reuse in modernizing enterprise systems

Information Systems Frontiers
Bootstrapped named entity recognition for product attribute extraction

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Measuring textual patent similarity on the basis of combined concepts: design decisions and their consequences

Scientometrics
Similarity patterns in words

EACL 2012 Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH
A lazy man's way to part-of-speech tagging

PKAW'12 Proceedings of the 12th Pacific Rim conference on Knowledge Management and Acquisition for Intelligent Systems
String similarity measures and joins with synonyms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
An extended Q-gram algorithm for calculating the relevance factor of products in electronic marketplaces

Electronic Commerce Research and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly well-known measures are based are edit distance and the length of the longest common subsequence. We develop a notion of n-gram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of n-gram distance and similarity, respectively. We provide formal, recursive definitions of n-gram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on n-grams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents.