A locally adaptive data compression scheme
Communications of the ACM
Software—Practice & Experience
Efficient decoding of prefix codes
Communications of the ACM
Text compression
A very fast substring search algorithm
Communications of the ACM
Handbook of algorithms and data structures: in Pascal and C (2nd ed.)
Handbook of algorithms and data structures: in Pascal and C (2nd ed.)
A new approach to text searching
Communications of the ACM
Fast text searching: allowing errors
Communications of the ACM
Data compression in full-text retrieval systems
Journal of the American Society for Information Science
Adding compression to a full-text retrieval system
Software—Practice & Experience
String matching in Lempel-Ziv compressed strings
STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
Let sleeping files lie: pattern matching in Z-compressed files
Journal of Computer and System Sciences
A text compression scheme that allows fast searching directly in the compressed file
ACM Transactions on Information Systems (TOIS)
Block addressing indices for approximate text retrieval
CIKM '97 Proceedings of the sixth international conference on Information and knowledge management
Fast searching on compressed text allowing errors
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient string matching: an aid to bibliographic search
Communications of the ACM
Generating a canonical prefix encoding
Communications of the ACM
Information Retrieval: Computational and Theoretical Aspects
Information Retrieval: Computational and Theoretical Aspects
Modern Information Retrieval
Adding Compression to Block Addressing Inverted Indexes
Information Retrieval
In-Place Calculation of Minimum-Redundancy Codes
WADS '95 Proceedings of the 4th International Workshop on Algorithms and Data Structures
A Unifying Framework for Compressed Pattern Matching
SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Multiple Pattern Matching in LZW Compressed Text
DCC '98 Proceedings of the Conference on Data Compression
Enhanced word-based block-sorting text compression
ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Compression of inverted indexes For fast query evaluation
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A general-purpose compression scheme for large collections
ACM Transactions on Information Systems (TOIS)
The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives
SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Faster String Matching with Super-Alphabets
SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Boyer-Moore String Matching over Ziv-Lempel Compressed Text
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Block Merging for Off-Line Compression
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
String Matching with Stopper Encoding and Code Splitting
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Regular Expression Searching over Ziv-Lempel Compressed Text
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Faster Approximate String Matching over Compressed Text
DCC '01 Proceedings of the Data Compression Conference
Regular expression searching on compressed text
Journal of Discrete Algorithms
Approximate string matching on Ziv-Lempel compressed text
Journal of Discrete Algorithms
Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries
ACM Transactions on Information Systems (TOIS)
Inverted Index Compression Using Word-Aligned Binary Codes
Information Retrieval
Heuristic compression of an English word list: Research Articles
Software—Practice & Experience
Improving Web search efficiency via a locality based static pruning method
WWW '05 Proceedings of the 14th international conference on World Wide Web
Word-based text compression using the Burrows-Wheeler transform
Information Processing and Management: an International Journal
Efficiently decodable and searchable natural language adaptive compression
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Pattern matching in Huffman encoded texts
Information Processing and Management: an International Journal
LZgrep: a Boyer–Moore string matching tool for Ziv–Lempel compressed text: Research Articles
Software—Practice & Experience
New bounds on D-ary optimal codes
Information Processing Letters
Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts
Information Processing and Management: an International Journal
Improved Word-Aligned Binary Compression for Text Indexing
IEEE Transactions on Knowledge and Data Engineering
Compressing and searching XML data via two zips
Proceedings of the 15th international conference on World Wide Web
Using structural contexts to compress semistructured text collections
Information Processing and Management: an International Journal
Efficient String Matching in Huffman Compressed Texts
Fundamenta Informaticae
A general compression algorithm that supports fast searching
Information Processing Letters
On-line Approximate String Matching in Natural Language
Fundamenta Informaticae
XQueC: A query-conscious compressed XML database
ACM Transactions on Internet Technology (TOIT)
Fast blocking of undesirable web pages on client PC by discriminating URL using neural networks
Expert Systems with Applications: An International Journal
Locality-Based pruning methods for web search
ACM Transactions on Information Systems (TOIS)
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Utilizing phrase-similarity measures for detecting and clustering informative RSS news articles
Integrated Computer-Aided Engineering
A Run-Time Efficient Implementation of Compressed Pattern Matching Automata
CIAA '08 Proceedings of the 13th international conference on Implementation and Applications of Automata
New adaptive compressors for natural language text
Software—Practice & Experience
Improved Variable-to-Fixed Length Codes
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Speeding Up Pattern Matching by Text Sampling
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Self-indexing Natural Language
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
The use of genetic programming for adaptive text compression
International Journal of Information and Coding Theory
RLH: Bitmap compression technique based on run-length and Huffman encoding
Information Systems
Simple Random Access Compression
Fundamenta Informaticae
Compressing and indexing labeled trees, with applications
Journal of the ACM (JACM)
Directly Addressable Variable-Length Codes
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Accelerating Boyer-Moore searches on binary texts
Theoretical Computer Science
Fast and Flexible Compression for Web Search Engines
Electronic Notes in Theoretical Computer Science (ENTCS)
Word-based text compression using the Burrows-Wheeler transform
Information Processing and Management: an International Journal
Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts
Information Processing and Management: an International Journal
New bounds on D-ary optimal codes
Information Processing Letters
The strategy design of compression and transmission on cGML spatial data and its application in LBS
WiCOM'09 Proceedings of the 5th International Conference on Wireless communications, networking and mobile computing
An efficient compression code for text databases
ECIR'03 Proceedings of the 25th European conference on IR research
Compressing semistructured text databases
ECIR'03 Proceedings of the 25th European conference on IR research
Improving semistatic compression via pair-based coding
PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Simple compression code supporting random access and fast string matching
WEA'07 Proceedings of the 6th international conference on Experimental algorithms
Accelerating Boyer Moore searches on binary texts
CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
Finding similar RSS news articles using correlation-based phrase matching
KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Dynamic lightweight text compression
ACM Transactions on Information Systems (TOIS)
Efficient text proximity search
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
A compressed self-indexed representation of XML documents
ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
A fast dynamic compression scheme for natural language texts
Computers & Mathematics with Applications
Algorithm engineering: bridging the gap between algorithm theory and practice
Algorithm engineering: bridging the gap between algorithm theory and practice
An efficient implementation of a flexible XPath extension
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Searching a pattern in compressed DNA sequences
International Journal of Bioinformatics Research and Applications
Improving semistatic compression via phrase-based modeling
Information Processing and Management: an International Journal
Information Processing and Management: an International Journal
Natural Language Compression on Edge-Guided text preprocessing
Information Sciences: an International Journal
Fast decoding algorithms for variable-lengths codes
Information Sciences: an International Journal
Indexes for highly repetitive document collections
Proceedings of the 20th ACM international conference on Information and knowledge management
Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections
Proceedings of the VLDB Endowment
Word-based self-indexes for natural language text
ACM Transactions on Information Systems (TOIS)
Inverted files versus suffix arrays for locating patterns in primary memory
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Phrase-Based pattern matching in compressed text
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
String matching with alphabet sampling
Journal of Discrete Algorithms
Compressing dynamic text collections via phrase-based coding
ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Enhanced byte codes with restricted prefix properties
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Simple Random Access Compression
Fundamenta Informaticae
On-line Approximate String Matching in Natural Language
Fundamenta Informaticae
Efficient String Matching in Huffman Compressed Texts
Fundamenta Informaticae
ODC: Frame for definition of Dense codes
European Journal of Combinatorics
A Lempel-Ziv text index on secondary storage
CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
DACs: Bringing direct access to variable-length codes
Information Processing and Management: an International Journal
Implicit indexing of natural language text by reorganizing bytecodes
Information Retrieval
Compressing IP forwarding tables: towards entropy bounds and beyond
Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Practical fixed length Lempel-Ziv coding
Discrete Applied Mathematics
Hi-index | 0.00 |
We present a fast compression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half of the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.