SASE: implementation of a compressed text search engine

Authors:
Srinidhi Varadarajan;Tzi-cker Chiueh
Affiliations:
Department of Computer Science, State University of New York, Stony Brook, NY;Department of Computer Science, State University of New York, Stony Brook, NY
Venue:
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Year:
1997

Citing 12
Cited 4

Data compression using static Huffman code-decode tables

Communications of the ACM
Dynamic Huffman coding

Journal of Algorithms
Access methods for text

ACM Computing Surveys (CSUR) - Annals of discrete mathematics, 24
Adding compression to a full-text retrieval system

Software—Practice & Experience
String matching in Lempel-Ziv compressed strings

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
Let sleeping files lie: pattern matching in Z-compressed files

Journal of Computer and System Sciences
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Economical encoding of commas between strings

Communications of the ACM
Common phrases and minimum-space text storage

Communications of the ACM
Content-Based Image Indexing

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
A Text Compression Scheme That Allows Fast Searching Directly in the Compressed File

CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching
Building a complete inverted file for a set of text files in linear time

STOC '84 Proceedings of the sixteenth annual ACM symposium on Theory of computing

A Search Engine for Indian Languages

EC-WEB '00 Proceedings of the First International Conference on Electronic Commerce and Web Technologies
String Matching Over Compressed Text on Handheld Devices Using Tagged Sub-Optimal Code (TSC)

Real-Time Systems
A web search engine model based on index-query bit-level compression

Proceedings of the 1st International Conference on Intelligent Semantic Web-Services and Applications
Development of a Novel Compressed Index-Query Web Search Engine Model

International Journal of Information Technology and Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Keyword based search engines are the basic building block of text retrieval systems. Higher level systems like content sensitive search engines and knowledge-based systems still rely on keyword search as the underlying text retrieval mechanism. With the explosive growth in content, Internet and Intranet information repositories require efficient mechanisms to store as well as index data. In this paper we discuss the implementation of the Shrink and Search Engine (SASE) framework which unites text compression and indexing to maximize keyword search performance while reducing storage cost. SASE features the novel capability of being able to directly search through compressed text without explicit decompression. The implementation includes a search server architecture, which can be accessed from a Java front-end to perform keyword search on the Internet. The performance results show that the compression efficiency of SASE is within 7-17% of GZIP one of the best lossless compression schemes. The sum of the compressed file size and the inverted indices is only between 55-76% of the original database while the search performance is comparable to a fully inverted index. The framework allows a flexible trade-off between search performance and storage requirements for the search indices.