Storing semi-structured data on disk drives

Authors:
Medha Bhadkamkar;Fernando Farfan;Vagelis Hristidis;Raju Rangaswami
Affiliations:
Florida International University, Miami, FL;Florida International University, Miami, FL;Florida International University, Miami, FL;Florida International University, Miami, FL
Venue:
ACM Transactions on Storage (TOS)
Year:
2009

Citing 41
Cited 4

An introduction to disk drive modeling

Computer
Shoring up persistent applications

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
On-line extraction of SCSI disk drive parameters

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Multiresolution video

SIGGRAPH '96 Proceedings of the 23rd annual conference on Computer graphics and interactive techniques
Lore: a database management system for semistructured data

ACM SIGMOD Record
A case for intelligent disks (IDISKs)

ACM SIGMOD Record
A cost-effective, high-bandwidth storage architecture

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Storing semistructured data with STORED

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
XOO7: applying OO7 benchmark to XML query processing tool

Proceedings of the tenth international conference on Information and knowledge management
Covering indexes for branching path queries

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Lazy XML processing

Proceedings of the 2002 ACM symposium on Document engineering
Object Exchange Across Heterogeneous Information Sources

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Active Storage for Large-Scale Data Mining and Multimedia

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Relational Databases for Querying XML Documents: Limitations and Opportunities

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Indexing and Querying XML Data for Regular Path Expressions

Proceedings of the 27th International Conference on Very Large Data Bases
XBench - A Family of Benchmarks for XML DBMSs

Proceedings of the VLDB 2002 Workshop EEXTT and CAiSE 2002 Workshop DTWeb on Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web-Revised Papers
Multi-user Evaluation of XML Data Management Systems with XMach-1

Proceedings of the VLDB 2002 Workshop EEXTT and CAiSE 2002 Workshop DTWeb on Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web-Revised Papers
XMach-1: A Benchmark for XML Data Management

Datenbanksysteme in Büro, Technik und Wissenschaft (BTW), 9. GI-Fachtagung,
TIMBER: A native XML database

The VLDB Journal — The International Journal on Very Large Data Bases
From XML Schema to Relations: A Cost-Based Approach to XML Storage

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Micro-Benchmark Based Extraction of Local and Global Disk

Micro-Benchmark Based Extraction of Local and Global Disk
XML parsing: a threat to database performance

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
The Lowell database research self-assessment

Communications of the ACM - Adaptive complex enterprises
System RX: one part relational, one part XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Cost-sensitive reordering of navigational primitives

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Semantically-Smart Disk Systems

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Diamond: A Storage Architecture for Early Discard in Interactive Search

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Atropos: A Disk Array Volume Manager for Orchestrated Use of Disks

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
The Michigan benchmark: towards XML query performance diagnostics

Information Systems
A linear time algorithm for optimal tree sibling partitioning and approximation algorithms in Natix

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
An XML transaction processing benchmark

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
On multidimensional data and modern disks

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Efficient algorithms for processing XPath queries

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
XMark: a benchmark for XML data management

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
OrientStore: a schema based native XML storage system

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
ShreX: managing XML documents in relational databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Search-Optimized suffix-tree storage for biological applications

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
XPathMark: an XPath benchmark for the XMark generated data

XSym'05 Proceedings of the Third international conference on Database and XML Technologies
MemBeR: a micro-benchmark repository for XQuery

XSym'05 Proceedings of the Third international conference on Database and XML Technologies
Beyond lazy XML parsing

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications

A new approach to "storage management" restrictions using the "data quality" concept

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Optimization of disk accesses for multidimensional range queries

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part I
Post-processing in wireless sensor networks: Benchmarking sensor trace files for in-network data aggregation

Journal of Network and Computer Applications
On the efficiency of multiple range query processing in multidimensional data structures

Proceedings of the 17th International Database Engineering & Applications Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Applications that manage semi-structured data are becoming increasingly commonplace. Current approaches for storing semi-structured data use existing storage machinery; they either map the data to relational databases, or use a combination of flat files and indexes. While employing these existing storage mechanisms provides readily available solutions, there is a need to more closely examine their suitability to this class of data. Particularly, retrofitting existing solutions for semi-structured data can result in a mismatch between the tree structure of the data and the access characteristics of the underlying storage device (disk drive). This study explores various possibilities in the design space of native storage solutions for semi-structured data by exploring alternative approaches that match application data access characteristics to those of the underlying disk drive. For evaluating the effectiveness of the proposed native techniques in relation to the existing solution, we experiment with XML data using the XPathMark benchmark. Extensive evaluation reveals the strengths and weaknesses of the proposed native data layout techniques. While the existing solutions work really well for deep-focused queries into a semi-structured document (those that result in retrieving entire subtrees), the proposed native solutions substantially outperform for the non-deep-focused queries, which we demonstrate are at least as important as the deep-focused. We believe that native data layout techniques offer a unique direction for improving the performance of semi-structured data stores for a variety of important workloads. However, given that the proposed native techniques require circumventing current storage stack abstractions, further investigation is warranted before they can be applied to general-purpose storage systems.