NoDB: efficient query execution on raw data files

Authors:
Ioannis Alagiannis;Renata Borovica;Miguel Branco;Stratos Idreos;Anastasia Ailamaki
Affiliations:
EPFL, Lausanne, Switzerland;EPFL, Lausanne, Switzerland;EPFL, Lausanne, Switzerland;CWI, Amsterdam, Switzerland;EPFL, Lausanne, Switzerland
Venue:
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Year:
2012

Citing 21
Cited 10

Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Automated Selection of Materialized Views and Indexes in SQL Databases

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
AutoPart: Automating Schema Design for Large Scientific Databases Using Data Partitioning

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Integrating vertical and horizontal partitioning into automated physical database design

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Automatic physical database tuning: a relaxation-based approach

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Scientific data management in the coming decade

ACM SIGMOD Record
COLT: continuous on-line tuning

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
To tune or not to tune?: a lightweight physical design alerter

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Making database systems usable

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Updating a cracked database

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
DB2 design advisor: integrated automatic physical database design

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Automatic SQL tuning in oracle 10g

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Optimizing SQL Queries over Text Databases

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Self-organizing tuple reconstruction in column-stores

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
Self-selecting, self-tuning, incrementally optimized indexes

Proceedings of the 13th International Conference on Extending Database Technology
Managing scientific data

Communications of the ACM
Benchmarking adaptive indexing

TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
CoPhy: a scalable, portable, and interactive index advisor for large workloads

Proceedings of the VLDB Endowment
Merging what's cracked, cracking what's merged: adaptive indexing in main-memory column-stores

Proceedings of the VLDB Endowment

Data vaults: a symbiosis between database technology and scientific file repositories

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
NoDB in action: adaptive query processing on raw data

Proceedings of the VLDB Endowment
Towards scalable ad-hoc climate anomalies search

Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data
Invisible loading: access-driven data transfer from raw files into database systems

Proceedings of the 16th International Conference on Extending Database Technology
Turning scientists into data explorers

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Data vaults: a database welcome to scientific file repositories

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
SDQuery DSI: integrating data management support with a wide area data transfer protocol

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The array database that is not a database: file based array query answering in rasdaman

SSTD'13 Proceedings of the 13th international conference on Advances in Spatial and Temporal Databases
Lazy ETL in action: ETL technology dates scientific data

Proceedings of the VLDB Endowment
Instant loading for main memory databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

As data collections become larger and larger, data loading evolves to a major bottleneck. Many applications already avoid using database systems, e.g., scientific data analysis and social networks, due to the complexity and the increased data-to-query time. For such applications data collections keep growing fast, even on a daily basis, and we are already in the era of data deluge where we have much more data than what we can move, store, let alone analyze. Our contribution in this paper is the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern DBMS, we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion costs. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure. Our implementation over PostgreSQL, called PostgresRaw, is able to avoid the loading cost completely, while matching the query performance of plain PostgreSQL and even outperforming it in many cases. We conclude that NoDB systems are feasible to design and implement over modern database architectures, bringing an unprecedented positive effect in usability and performance.