On effective multi-dimensional indexing for strings

Authors:
H. V. Jagadish;Nick Koudas;Divesh Srivastava
Affiliations:
University of Michigan;AT&T Labs-Research;AT&T Labs-Research
Venue:
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Year:
2000

Citing 21
Cited 20

The design and analysis of spatial data structures

The design and analysis of spatial data structures
The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
New indices for text: PAT Trees and PAT arrays

Information retrieval
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
A fully-dynamic data structure for external substring search

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
On sorting strings in external memory (extended abstract)

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Multidimensional access methods

ACM Computing Surveys (CSUR)
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Querying network directories

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Fast string searching in secondary storage: theoretical developments and experimental results

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
Direct spatial search on pictorial databases using packed R-trees

SIGMOD '85 Proceedings of the 1985 ACM SIGMOD international conference on Management of data
Prefix B-trees

ACM Transactions on Database Systems (TODS)
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Understanding and Deploying LDAP Directory Services

Understanding and Deploying LDAP Directory Services
The K-D-B-tree: a search structure for large multidimensional dynamic indexes

SIGMOD '81 Proceedings of the 1981 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Multi-Dimensional Substring Selectivity Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Filter Trees for Managing Spatial Data over a Range of Size Granularities

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

Two-dimensional substring indexing

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A compact B-tree

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
A Fast Index for Semistructured Data

Proceedings of the 27th International Conference on Very Large Data Bases
On effective classification of strings with wavelets

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An Index Structure for Pattern Similarity Searching in DNA Microarray Data

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Generalized substring selectivity estimation

Journal of Computer and System Sciences - Special issue on PODS 2000
Two-dimensional substring indexing

Journal of Computer and System Sciences - Special issu on PODS 2001
A compressed accessibility map for XML

ACM Transactions on Database Systems (TODS)
On the Use of Wavelet Decomposition for String Classification

Data Mining and Knowledge Discovery
Indexing mixed types for approximate retrieval

VLDB '05 Proceedings of the 31st international conference on Very large data bases
BeTrIS: an index system for MPEG-7 streams

EURASIP Journal on Applied Signal Processing
Real-valued feature indexing for music databases

Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication
Scalable multi-feature index structure for music databases

Information Sciences: an International Journal
Efficient and scalable indexing techniques for biological sequence data

BIRD'07 Proceedings of the 1st international conference on Bioinformatics research and development
Estimating the number of substring matches in long string databases

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Discovering consensus patterns in biological databases

VDMB'06 Proceedings of the First international conference on Data Mining and Bioinformatics
Adapting the pyramid technique for indexing ontological data

ISCIS'06 Proceedings of the 21st international conference on Computer and Information Sciences
Clustering large scale of XML documents

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As databases have expanded in scope from storing purely business data to include XML documents, product catalogs, e-mail messages, and directory data, it has become increasingly important to search databases based on wild-card string matching: prefix matching, for example, is more common (and useful) than exact matching, for such data. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions. Traditional multi-dimensional index structures, designed with (fixed length) numeric data in mind, are not suitable for matching unbounded length string data.In this paper, we describe a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data. The key ideas are (a) a carefully developed mapping function from strings to rational numbers, (b) representing an unbounded length string in an index leaf page by a fixed length offset to an external key, and (c) storing multiple elided tries, one per dimension, in an index page to prune search during traversal of index pages. These basic ideas affect all index algorithms. In this paper, we present efficient algorithms for different types of string matching.While our technique is applicable to a wide range of multi-dimensional index structures, we instantiate our generic techniques by adapting the 2-dimensional R-tree to string data. We demonstrate the space effectiveness and time benefits of using the string R-tree both analytically and experimentally.