Semi-indexing semi-structured data in tiny space

Authors:
Giuseppe Ottaviano;Roberto Grossi
Affiliations:
Università di Pisa, Pisa, Italy;Università di Pisa, Pisa, Italy
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 15
Cited 0

Efficient Storage and Retrieval by Content and Address of Static Files

Journal of the ACM (JACM)
Succinct representation of balanced parentheses, static trees and planar graphs

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Querying and maintaining a compact XML storage

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
XMark: a benchmark for XML data management

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

ACM Transactions on Algorithms (TALG)
Efficiently Querying Large XML Data Repositories: A Survey

IEEE Transactions on Knowledge and Data Engineering
Engineering succinct DOM

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Space-efficient static trees and graphs

SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
2LP: A double-lazy XML parser

Information Systems
XML compression techniques: A survey and comparison

Journal of Computer and System Sciences
Compressing and indexing labeled trees, with applications

Journal of the ACM (JACM)
Broadword implementation of rank/select queries

WEA'08 Proceedings of the 7th international conference on Experimental algorithms
Fully-functional succinct trees

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Semi-structured textual formats are gaining increasing popularity for the storage of document collections and rich logs. Their flexibility comes at the cost of having to load and parse a document entirely even if just a small part of it needs to be accessed. For instance, in data analytics massive collections are usually scanned sequentially, selecting a small number of attributes from each document. We propose a technique to attach to a raw, unparsed document (even in compressed form) a "semi-index": a succinct data structure that supports operations on the document tree at speed comparable with an in-memory deserialized object, thus bridging textual formats with binary formats. After describing the general technique, we focus on the JSON format: our experiments show that avoiding the full loading and parsing step can give speedups of up to 12 times for on-disk documents using a small space overhead.