Structural optimization of a full-text n-gram index using relational normalization

Authors:
Min-Soo Kim;Kyu-Young Whang;Jae-Gil Lee;Min-Jae Lee
Affiliations:
Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea;Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea;Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea;Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2008

Citing 23
Cited 4

Principles of database and knowledge-base systems, Vol. I

Principles of database and knowledge-base systems, Vol. I
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
A new character-based indexing method using frequency data for Japanese documents

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Using n-grams for Korean text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Database management systems

Database management systems
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Block addressing indices for approximate text retrieval

Journal of the American Society for Information Science - Special topic issue: When museum informatics meets the World Wide Web
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Building a distributed full-text index for the web

ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval

Modern Information Retrieval
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Toward a unified approach to statistical language modeling for Chinese

ACM Transactions on Asian Language Information Processing (TALIP)
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Genomic information retrieval

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Single n-gram stemming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Fundamentals of Database Systems, Fourth Edition

Fundamentals of Database Systems, Fourth Edition
Odysseus: A High-Performance ORDBMS Tightly-Coupled with IR Features

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
n-gram/2L: a space and time efficient two-level n-gram inverted index structure

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Inverted files versus suffix arrays for locating patterns in primary memory

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Indexing DNA sequences using q-grams

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications

A novel hash-based streaming scheme for energy efficient full-text search in wireless data broadcast

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Design by example for SQL table definitions with functional dependencies

The VLDB Journal — The International Journal on Very Large Data Bases
Effectiveness of an implementation method for retrieving similar strings by trie structures

International Journal of Computer Applications in Technology
Applying a lightweight iterative merging chinese segmentation in web image annotation

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.01

Visualization

Abstract

As the amount of text data grows explosively, an efficient index structure for large text databases becomes ever important. The n-gram inverted index (simply, the n-gram index) has been widely used in information retrieval or in approximate string matching due to its two major advantages: language-neutral and error-tolerant. Nevertheless, the n-gram index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance by using the relational normalization theory. We first identify that, in the (full-text) n-gram index, there exists redundancy in the position information caused by a non-trivial multivalued dependency. The proposed index eliminates such redundancy by constructing the index in two levels: the front-end index and the back-end index. We formally prove that this two-level construction is identical to the relational normalization process. We call this process structural optimization of the n-gram index. The n-gram/2L index has excellent properties: (1) it significantly reduces the size and improves the performance compared with the n-gram index with these improvements becoming more marked as the database size gets larger; (2) the query processing time increases only very slightly as the query length gets longer. Experimental results using real databases of 1 GB show that the size of the n-gram/2L index is reduced by up to 1.9---2.4 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram index. We also compare the n-gram/2L index with Makinen's compact suffix array (CSA) (Proc. 11th Annual Symposium on Combinatorial Pattern Matching pp. 305---319, 2000) stored in disk. Experimental results show that the n-gram/2L index outperforms the CSA when the query length is short (i.e., less than 15---20), and the CSA is similar to or better than the n-gram/2L index when the query length is long (i.e., more than 15---20).