Principles of database and knowledge-base systems, Vol. I
Principles of database and knowledge-base systems, Vol. I
Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
Self-indexing inverted files for fast text retrieval
ACM Transactions on Information Systems (TOIS)
Using n-grams for Korean text retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Recursive hashing functions for n-grams
ACM Transactions on Information Systems (TOIS)
Database management systems
Optimizing query evaluation in n-gram indexing
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Modern Information Retrieval
Database Systems Concepts
Compression of inverted indexes For fast query evaluation
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Indexing and Retrieval for Genomic Databases
IEEE Transactions on Knowledge and Data Engineering
ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Fundamentals of Database Systems, Fourth Edition
Fundamentals of Database Systems, Fourth Edition
Odysseus: A High-Performance ORDBMS Tightly-Coupled with IR Features
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Fast nGram-based string search over data encoded using algebraic signatures
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Similarity Search for Tree-Structured Data
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Structural optimization of a full-text n-gram index using relational normalization
The VLDB Journal — The International Journal on Very Large Data Bases
SNIF TOOL: sniffing for patterns in continuous streams
Proceedings of the 17th ACM conference on Information and knowledge management
TinyLex: static n-gram index pruning with perfect recall
Proceedings of the 17th ACM conference on Information and knowledge management
Foundations and Trends in Databases
Space-economical partial gram indices for exact substring matching
Proceedings of the 18th ACM conference on Information and knowledge management
AS-index: a structure for string search using n-grams and algebraic signatures
Proceedings of the 18th ACM conference on Information and knowledge management
Reference-based alignment in large sequence databases
Proceedings of the VLDB Endowment
Simple and efficient algorithm for approximate dictionary matching
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Trie-join: efficient trie-based string similarity joins with edit-distance constraints
Proceedings of the VLDB Endowment
WHAM: a high-throughput sequence alignment method
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search
The VLDB Journal — The International Journal on Very Large Data Bases
Information Sciences: an International Journal
Can we beat the prefix filtering?: an adaptive framework for similarity join and search
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A generic framework for efficient and effective subsequence retrieval
Proceedings of the VLDB Endowment
WHAM: A High-Throughput Sequence Alignment Method
ACM Transactions on Database Systems (TODS)
FPI: a novel indexing method using frequent patterns for approximate string searches
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Extending string similarity join to tolerant fuzzy token matching
ACM Transactions on Database Systems (TODS)
Hi-index | 0.00 |
The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redundancy of the position information that exists in the n-gram inverted index. The proposed index is constructed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer. Experimental results using databases of 1 GBytes show that the size of the n-gram/2L index is reduced by up to 1.9 ~ 2.7 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram inverted index.