VGRAM: improving performance of approximate queries on string collections using variable-length grams

Authors:
Chen Li;Bin Wang;Xiaochun Yang
Affiliations:
University of California, Irvine, CA;Northeastern University, Liaoning, China;Northeastern University, Liaoning, China
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 22
Cited 53

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
On Using q-Gram Locations in Approximate String Matching

ESA '95 Proceedings of the Third Annual European Symposium on Algorithms
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
q-Gram Matching Using Tree Models

IEEE Transactions on Knowledge and Data Engineering
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
A variable-length category-based n-gram language model

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
N-Gram feature selection for authorship identification

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Application of variable length N-gram vectors to monolingual and bilingual information retrieval

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images

Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Similarity Search for Tree-Structured Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Approximate substring selectivity estimation

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Performance evaluation of similarity join for real time information integration

Proceedings of the 2nd Bangalore Annual Compute Conference
Efficient interactive fuzzy keyword search

Proceedings of the 18th international conference on World wide web
Efficient top-k algorithms for fuzzy search in string collections

Proceedings of the First International Workshop on Keyword Search on Structured Data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
Space-economical partial gram indices for exact substring matching

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient algorithms for approximate member extraction using signature-based inverted lists

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Reference-based alignment in large sequence databases

Proceedings of the VLDB Endowment
Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Probabilistic string similarity joins

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Similarity joins as stronger metric operations

SIGSPATIAL Special
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Exact and efficient proximity graph computation

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
SigMatch: fast and scalable multi-pattern matching

Proceedings of the VLDB Endowment
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Approximate entity extraction in temporal databases

World Wide Web
Approximate String Processing

Foundations and Trends in Databases
WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Schema mapping with quality assurance for data integration

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
PG-Skip: proximity graph based clustering of long strings

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
An effective approach for searching closest sentence translations from the web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Embedding-based subsequence matching in time-series databases

ACM Transactions on Database Systems (TODS)
A fast and accurate method for approximate string search

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
PG-join: proximity graph based string similarity joins

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
Facilitating pattern discovery for relation extraction with semantic-signature-based clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A generic framework for efficient and effective subsequence retrieval

Proceedings of the VLDB Endowment
Efficient range queries over uncertain strings

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
WHAM: A High-Throughput Sequence Alignment Method

ACM Transactions on Database Systems (TODS)
Landmark-join: hash-join based string similarity joins with edit distance constraints

DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Size matters: finding the most informative set of window lengths

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Detecting near-duplicate documents using sentence-level features and supervised learning

Expert Systems with Applications: An International Journal
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Approximate string matching by position restricted alignment

Proceedings of the Joint EDBT/ICDT 2013 Workshops
FPI: a novel indexing method using frequent patterns for approximate string searches

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Cache-aware parallel approximate matching and join algorithms using BWT

Proceedings of the Joint EDBT/ICDT 2013 Workshops
PartSS: an efficient partition-based filtering for edit distance constraints

ADC '11 Proceedings of the Twenty-Second Australasian Database Conference - Volume 115
Efficient top-k algorithms for approximate substring matching

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)
Clustering with Proximity Graphs: Exact and Efficient Algorithms

International Journal of Knowledge-Based Organizations

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many applications need to solve the following problem of approximate string matching: from a collection of strings, how to find those similar to a given string, or the strings in another (possibly the same) collection of strings? Many algorithms are developed using fixed-length grams, which are substrings of a string used as signatures to identify similar strings. In this paper we develop a novel technique, called VGRAM, to improve the performance of these algorithms. Its main idea is to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection. We give a full specification of this technique, including how to select high-quality grams from the collection, how to generate variable-length grams for a string based on the preselected grams, and what is the relationship between the similarity of the gram sets of two strings and their edit distance. A primary advantage of the technique is that it can be adopted by a plethora of approximate string algorithms without the need to modify them substantially. We present our extensive experiments on real data sets to evaluate the technique, and show the significant performance improvements on three existing algorithms.