Self-indexing Natural Language

Authors:
Nieves R. Brisaboa;Antonio Fariña;Gonzalo Navarro;Angeles S. Places;Eduardo Rodríguez
Affiliations:
Database Lab., Univ. da Coruña, Spain;Database Lab., Univ. da Coruña, Spain;Dept. of Computer Science, Univ. of Chile,;Database Lab., Univ. da Coruña, Spain;Database Lab., Univ. da Coruña, Spain
Venue:
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Year:
2008

Citing 28
Cited 4

Word-based text compression

Software—Practice & Experience
Text compression

Text compression
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Block addressing indices for approximate text retrieval

Journal of the American Society for Information Science - Special topic issue: When museum informatics meets the World Wide Web
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Modern Information Retrieval

Modern Information Retrieval
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
In-Place Calculation of Minimum-Redundancy Codes

WADS '95 Proceedings of the 4th International Workshop on Algorithms and Data Structures
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Lightweight natural language text compression

Information Retrieval
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Reorganizing compressed text

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Word-Based Statistical Compressors as Natural Language Compression Boosters

DCC '08 Proceedings of the Data Compression Conference
Space-efficient static trees and graphs

SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Compact set representation for information retrieval

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Faster adaptive set intersections for text searching

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Reducing the space requirement of LZ-Index

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Experimental analysis of a fast intersection algorithm for sorted sequences

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Indexes for highly repetitive document collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Implicit indexing of natural language text by reorganizing bytecodes

Information Retrieval

Quantified Score

Hi-index	0.01

Visualization

Abstract

Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.