AS-index: a structure for string search using n-grams and algebraic signatures

Authors:
Cédric du Mouza;Witold Litwin;Philippe Rigaux;Thomas Schwarz
Affiliations:
CNAM, Paris, France;CERIA, Paris, France;INRIA-SACLAY, Orsay, France;Santa Clara Univ., Santa Clara, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 18
Cited 2

Signature files

Information retrieval
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Efficient implementation of suffix trees

Software—Practice & Experience
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Inverted files versus signature files for text indexing

ACM Transactions on Database Systems (TODS)
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Reducing the space requirement of suffix trees

Software—Practice & Experience
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
A Very Fast String Matching Algorithm for Small Alphabeths and Long Patterns (Extended Abstract)

CPM '98 Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching
Algebraic Signatures for Scalable Distributed Data Structures

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
The ed-tree: an index for large DNA sequence databases

SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management
Genome-scale disk-based suffix tree indexing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
SeFS: Unleashing the power of full-text search on file systems

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Better external memory suffix array construction

Journal of Experimental Algorithmics (JEA)
Indexing DNA sequences using q-grams

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications

Efficiently encoding term co-occurrences in inverted indexes

Proceedings of the 20th ACM international conference on Information and knowledge management
An approach for indexing file names in a directory

Proceedings of the 13th International Conference on Computer Systems and Technologies

Quantified Score

Hi-index	0.01

Visualization

Abstract

AS-Index is a new index structure for exact string search in disk resident databases. It uses hashing, unlike known alternatives, whether baesd on trees or tries. It typically indexes every n-gram in the database, though non-dense indexing is also possible. The hash function uses the algebraic signatures of all n-grams. Use of hashing provides for constant index access time for arbitrarily long patterns, unlike other structures whose search cost is at best logarithmic. The storage overhead of AS-Index is basically 500 - 600%, similar to that of alternatives or smaller. We show the index structure, our use of algebraic signatures and the search algorithm. We discuss the design trade-offs and present the theoretical and experimental performance analysis. We compare the AS-Index to main alternatives. We conclude that AS-Index is an attractive structure and we indicate directions for future work.