Partial retrieval of compressed semi-structured documents

Authors:
Ashutosh Gupta;Suneeta Agarwal
Affiliations:
Department of Computer Science & Information Technology, Institute of Engineering and Technology, MJP Rohilkhand University, Bareilly, India.;Department of Computer Science & Engineering, MNNIT, Allahabad, India
Venue:
International Journal of Computer Applications in Technology
Year:
2010

Citing 15
Cited 1

Word-based text compression

Software—Practice & Experience
Text compression

Text compression
Options in physical database design

ACM SIGMOD Record
Arithmetic coding for data compression

Communications of the ACM
XMill: an efficient compressor for XML data

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Compression and Coding Algorithms

Compression and Coding Algorithms
Data Compression Support in Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
XPRESS: a queriable compression for XML data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Compressing XML with Multiplexed Hierarchical PPM Models

DCC '01 Proceedings of the Data Compression Conference
XGRIND: A Query-Friendly XML Compressor

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Fast Searching over Compressed Text using A New Coding Technique: Tagged Sub-optimal Code (TSC)

DCC '04 Proceedings of the Conference on Data Compression
Lempel-Ziv Compression of Structured Text

DCC '04 Proceedings of the Conference on Data Compression
Using structural contexts to compress semistructured text collections

Information Processing and Management: an International Journal
User modeling for personalized Web search with self-organizing map: Research Articles

Journal of the American Society for Information Science and Technology
A Technique for High-Performance Data Compression

Computer

Searching a pattern in compressed DNA sequences

International Journal of Bioinformatics Research and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a compression model called tri-structural contexts model (TSCM), for semi-structured documents. The intention is that separation of the start tag, the attribute name/attribute value and textual words may reduce the entropy. We also combine the attributes with their values and use a separate container for them. We mainly focus on semi-static models, and test our idea using a word-based tagged code. This code allows random access and partial decompression of the compressed collection. The compression time is found to be better than scmhuff and decompression time is also observed much less than scmhuff and xmlppm. The shorter time for partial decompression emphasises the use of TSC model to keep the semi-structured document compressed all the time. The algorithm and proposed model are useful in information retrieval systems.