Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Authors:
Xiaochun Yang;Bin Wang;Chen Li
Affiliations:
Northeastern University, Shenyang, China;Northeastern University, Shenyang, China;University of California, Irvine, Irvine, CA, USA
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 20
Cited 26

Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Indexing text data under space constraints

Proceedings of the thirteenth ACM international conference on Information and knowledge management
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
q-Gram Matching Using Tree Models

IEEE Transactions on Knowledge and Data Engineering
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Estimating the selectivity of approximate string queries

ACM Transactions on Database Systems (TODS)
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Extending q-grams to estimate selectivity of string matching with low edit distance

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Improving "email speech acts" analysis via n-gram selection

ACTS '09 Proceedings of the HLT-NAACL 2006 Workshop on Analyzing Conversations in Text and Speech

Efficient top-k count queries over imprecise duplicates

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Incremental maintenance of length normalized indexes for approximate string matching

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Space-economical partial gram indices for exact substring matching

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient approximate search on string collections

Proceedings of the VLDB Endowment
Reference-based alignment in large sequence databases

Proceedings of the VLDB Endowment
SimDB: a similarity-aware database system

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Approximate entity extraction in temporal databases

World Wide Web
Approximate String Processing

Foundations and Trends in Databases
WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient exact edit similarity query processing with the asymmetric signature scheme

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
A fast and accurate method for approximate string search

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
A generic framework for efficient and effective subsequence retrieval

Proceedings of the VLDB Endowment
Efficient range queries over uncertain strings

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
WHAM: A High-Throughput Sequence Alignment Method

ACM Transactions on Database Systems (TODS)
Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

Proceedings of the Joint EDBT/ICDT 2013 Workshops
FPI: a novel indexing method using frequent patterns for approximate string searches

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Cache-aware parallel approximate matching and join algorithms using BWT

Proceedings of the Joint EDBT/ICDT 2013 Workshops
LinkIT: privacy preserving record linkage and integration via transformations

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems (TODS)
Similarity queries: their conceptual evaluation, transformations, and processing

The VLDB Journal — The International Journal on Very Large Data Bases
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)
Automated discovery of multi-faceted ontologies for accurate query answering and future semantic reasoning

Data & Knowledge Engineering
Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Approximate queries on a collection of strings are important in many applications such as record linkage, spell checking, and Web search, where inconsistencies and errors exist in data as well as queries. Several existing algorithms use the concept of "grams," which are substrings of strings used as signatures for the strings to build index structures. A recently proposed technique, called VGRAM, improves the performance of these algorithms by using a carefully chosen dictionary of variable-length grams based on their requencies in the string collection. Since an index structure using fixed-length grams can be viewed as a special case of VGRAM, a fundamental problem arises naturally: what is the relationship between the gram dictionary and the performance of queries? We study this problem in this paper. We propose a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance. We analyze how a gram dictionary affects the index structure of the string collection and ultimately the performance of queries. We also propose an algorithm for automatically computing a dictionary of high-quality grams for a workload of queries. Our experiments on real data sets show the improvement on query performance achieved by these techniques. To our best knowledge, this study is the first cost-based quantitative approach to deciding good grams for approximate string queries.