Principles of database and knowledge-base systems, Vol. I
Principles of database and knowledge-base systems, Vol. I
Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
A new character-based indexing method using frequency data for Japanese documents
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using n-grams for Korean text retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Database management systems
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Block addressing indices for approximate text retrieval
Journal of the American Society for Information Science - Special topic issue: When museum informatics meets the World Wide Web
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Building a distributed full-text index for the web
ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval
Compression of inverted indexes For fast query evaluation
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Toward a unified approach to statistical language modeling for Chinese
ACM Transactions on Asian Language Information Processing (TALIP)
Indexing and Retrieval for Genomic Databases
IEEE Transactions on Knowledge and Data Engineering
Filtration with q-Samples in Approximate String Matching
CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Fundamentals of Database Systems, Fourth Edition
Fundamentals of Database Systems, Fourth Edition
Odysseus: A High-Performance ORDBMS Tightly-Coupled with IR Features
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
n-gram/2L: a space and time efficient two-level n-gram inverted index structure
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Inverted files for text search engines
ACM Computing Surveys (CSUR)
Inverted files versus suffix arrays for locating patterns in primary memory
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Indexing DNA sequences using q-grams
DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
A novel hash-based streaming scheme for energy efficient full-text search in wireless data broadcast
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Design by example for SQL table definitions with functional dependencies
The VLDB Journal — The International Journal on Very Large Data Bases
Effectiveness of an implementation method for retrieving similar strings by trie structures
International Journal of Computer Applications in Technology
Applying a lightweight iterative merging chinese segmentation in web image annotation
MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Hi-index | 0.01 |
As the amount of text data grows explosively, an efficient index structure for large text databases becomes ever important. The n-gram inverted index (simply, the n-gram index) has been widely used in information retrieval or in approximate string matching due to its two major advantages: language-neutral and error-tolerant. Nevertheless, the n-gram index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance by using the relational normalization theory. We first identify that, in the (full-text) n-gram index, there exists redundancy in the position information caused by a non-trivial multivalued dependency. The proposed index eliminates such redundancy by constructing the index in two levels: the front-end index and the back-end index. We formally prove that this two-level construction is identical to the relational normalization process. We call this process structural optimization of the n-gram index. The n-gram/2L index has excellent properties: (1) it significantly reduces the size and improves the performance compared with the n-gram index with these improvements becoming more marked as the database size gets larger; (2) the query processing time increases only very slightly as the query length gets longer. Experimental results using real databases of 1 GB show that the size of the n-gram/2L index is reduced by up to 1.9---2.4 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram index. We also compare the n-gram/2L index with Makinen's compact suffix array (CSA) (Proc. 11th Annual Symposium on Combinatorial Pattern Matching pp. 305---319, 2000) stored in disk. Experimental results show that the n-gram/2L index outperforms the CSA when the query length is short (i.e., less than 15---20), and the CSA is similar to or better than the n-gram/2L index when the query length is long (i.e., more than 15---20).