Implicit indexing of natural language text by reorganizing bytecodes

Authors:
Nieves R. Brisaboa;Antonio Fariña;Susana Ladra;Gonzalo Navarro
Affiliations:
Database Laboratory, University of A Coruña, A Coruña, Spain 15071;Database Laboratory, University of A Coruña, A Coruña, Spain 15071;Database Laboratory, University of A Coruña, A Coruña, Spain 15071;Department of Computer Science, University of Chile, Santiago, Chile 2120
Venue:
Information Retrieval
Year:
2012

Citing 33
Cited 2

A locally adaptive data compression scheme

Communications of the ACM
Word-based text compression

Software—Practice & Experience
Inverted files versus signature files for text indexing

ACM Transactions on Database Systems (TODS)
Compact pat trees

Compact pat trees
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
A fast string searching algorithm

Communications of the ACM
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Modern Information Retrieval

Modern Information Retrieval
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Lightweight natural language text compression

Information Retrieval
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Efficient document retrieval in main memory

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Reorganizing compressed text

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Space-efficient static trees and graphs

SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Self-indexing Natural Language

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Practical Rank/Select Queries over Arbitrary Sequences

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
An experimental investigation of set intersection algorithms for text searching

Journal of Experimental Algorithmics (JEA)
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
Dynamic lightweight text compression

ACM Transactions on Information Systems (TOIS)
Compact set representation for information retrieval

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
A compressed self-indexed representation of XML documents

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Efficient set intersection for inverted indexing

ACM Transactions on Information Systems (TOIS)
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Compressed self-indices supporting conjunctive queries on document collections

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Enhanced byte codes with restricted prefix properties

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Ranked document retrieval in (almost) no space

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Effects of Terms Recognition Mistakes on Requests Processing for Interactive Information Retrieval

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Word-based byte-oriented compression has succeeded on large natural language text databases, by providing competitive compression ratios, fast random access, and direct sequential searching. We show that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and using negligible additional space, we obtain a new implicitly indexed representation of the compressed text, where search times are drastically improved. The occurrences of a word can be listed directly, without any text scanning, and in general any inverted-index-like capability, such as efficient phrase searches, can be emulated without storing any inverted list information. We experimentally show that our proposal performs not only much more efficiently than sequential searches over compressed text, but also than explicit inverted indexes and other types of indexes, when using little extra space. Our representation is especially successful when searching for single words and short phrases.