n-gram/2L: a space and time efficient two-level n-gram inverted index structure

Authors:
Min-Soo Kim;Kyu-Young Whang;Jae-Gil Lee;Min-Jae Lee
Affiliations:
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea;Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea;Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea;Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
Venue:
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Year:
2005

Citing 17
Cited 22

Principles of database and knowledge-base systems, Vol. I

Principles of database and knowledge-base systems, Vol. I
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Using n-grams for Korean text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Recursive hashing functions for n-grams

ACM Transactions on Information Systems (TOIS)
Database management systems

Database management systems
Optimizing query evaluation in n-gram indexing

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
Database Systems Concepts

Database Systems Concepts
Compression of inverted indexes For fast query evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Genomic information retrieval

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Single n-gram stemming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Fundamentals of Database Systems, Fourth Edition

Fundamentals of Database Systems, Fourth Edition
Odysseus: A High-Performance ORDBMS Tightly-Coupled with IR Features

ICDE '05 Proceedings of the 21st International Conference on Data Engineering

Fast nGram-based string search over data encoded using algebraic signatures

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Similarity Search for Tree-Structured Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Structural optimization of a full-text n-gram index using relational normalization

The VLDB Journal — The International Journal on Very Large Data Bases
SNIF TOOL: sniffing for patterns in continuous streams

Proceedings of the 17th ACM conference on Information and knowledge management
TinyLex: static n-gram index pruning with perfect recall

Proceedings of the 17th ACM conference on Information and knowledge management
Information Extraction

Foundations and Trends in Databases
Space-economical partial gram indices for exact substring matching

Proceedings of the 18th ACM conference on Information and knowledge management
AS-index: a structure for string search using n-grams and algebraic signatures

Proceedings of the 18th ACM conference on Information and knowledge management
Reference-based alignment in large sequence databases

Proceedings of the VLDB Endowment
Simple and efficient algorithm for approximate dictionary matching

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
WHAM: a high-throughput sequence alignment method

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient fuzzy full-text type-ahead search

The VLDB Journal — The International Journal on Very Large Data Bases
Integration of a secure type-2 fuzzy ontology with a multi-agent platform: A proposal to automate the personalized flight ticket booking domain

Information Sciences: an International Journal
Can we beat the prefix filtering?: an adaptive framework for similarity join and search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A generic framework for efficient and effective subsequence retrieval

Proceedings of the VLDB Endowment
WHAM: A High-Throughput Sequence Alignment Method

ACM Transactions on Database Systems (TODS)
FPI: a novel indexing method using frequent patterns for approximate string searches

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Extending string similarity join to tolerant fuzzy token matching

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redundancy of the position information that exists in the n-gram inverted index. The proposed index is constructed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer. Experimental results using databases of 1 GBytes show that the size of the n-gram/2L index is reduced by up to 1.9 ~ 2.7 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram inverted index.