The Noisy Substring Matching Problem

Authors:
R. L. Kashyap;B. J. Oommen
Affiliations:
School of Electrical Engineering, Purdue University;-
Venue:
IEEE Transactions on Software Engineering
Year:
1983

Citing 0
Cited 4

A Normalized Levenshtein Distance Metric

IEEE Transactions on Pattern Analysis and Machine Intelligence
Application of q-Gram Distance in Digital Forensic Search

IWCF '08 Proceedings of the 2nd international workshop on Computational Forensics
Constraint solving over OCR graphs

INAP'01 Proceedings of the Applications of prolog 14th international conference on Web knowledge management and decision support
A bibliography on computational molecular biology and genetics

Mathematical and Computer Modelling: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Let T(U) be the set of words in the dictionary H which contains U as a substring. The problem considered here is the estimation of the set T(U) when U is not known, but Y, a noisy version of U is available. The suggested set estimate S*(Y) of T(U) is a proper subset of H such that its every element contains at least one substring which resembles Y most according to the Levenshtein metric. The proposed algorithm for-the computation of S*(Y) requires cubic time. The algorithm uses the recursively computable dissimilarity measure Dk(X, Y), termed as the kth distance between two strings X and Y which is a dissimilarity measure between Y and a certain subset of the set of contiguous substrings of X. Another estimate of T(U), namely SM(Y) is also suggested. The accuracy of SM(Y) is only slightly less than that of S*(Y), but the computation time of SM(Y) is substantially less than that of S*(Y). Experimental results involving 1900 noisy substrings and dictionaries which are subsets of 1023 most common English words [11] indicate that the accuracy of the estimate S*(Y) is around 99 percent and that of SM(Y) is about 98 percent.