Invisible loading: access-driven data transfer from raw files into database systems

Authors:
Azza Abouzied;Daniel J. Abadi;Avi Silberschatz
Affiliations:
Yale University;Yale University;Yale University
Venue:
Proceedings of the 16th International Conference on Extending Database Technology
Year:
2013

Citing 13
Cited 1

Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Self-tuning database systems: a decade of progress

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Graceful database schema evolution: the PRISM workbench

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Self-organizing tuple reconstruction in column-stores

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Schema AND Data: A Holistic Approach to Mapping, Resolution and Fusion in Information Integration

ER '09 Proceedings of the 28th International Conference on Conceptual Modeling
Self-selecting, self-tuning, incrementally optimized indexes

Proceedings of the 13th International Conference on Extending Database Technology
HadoopToSQL: a mapReduce query optimizer

Proceedings of the 5th European conference on Computer systems
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
NoDB: efficient query execution on raw data files

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Mosquito: another one bites the data upload stream

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Commercial analytical database systems suffer from a high "time-to-first-analysis": before data can be processed, it must be modeled and schematized (a human effort), transferred into the database's storage layer, and optionally clustered and indexed (a computational effort). For many types of structured data, this upfront effort is unjustifiable, so the data are processed directly over the file system using the Hadoop framework, despite the cumulative performance benefits of processing this data in an analytical database system. In this paper we describe a system that achieves the immediate gratification of running MapReduce jobs directly over a file system, while still making progress towards the long-term performance benefits of database systems. The basic idea is to piggyback on MapReduce jobs, leverage their parsing and tuple extraction operations to incrementally load and organize tuples into a database system, while simultaneously processing the file system data. We call this scheme Invisible Loading, as we load fractions of data at a time at almost no marginal cost in query latency, but still allow future queries to run much faster.