Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
Self-tuning database systems: a decade of progress
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Graceful database schema evolution: the PRISM workbench
Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Self-organizing tuple reconstruction in column-stores
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
Proceedings of the VLDB Endowment
Schema AND Data: A Holistic Approach to Mapping, Resolution and Fusion in Information Integration
ER '09 Proceedings of the 28th International Conference on Conceptual Modeling
Self-selecting, self-tuning, incrementally optimized indexes
Proceedings of the 13th International Conference on Extending Database Technology
HadoopToSQL: a mapReduce query optimizer
Proceedings of the 5th European conference on Computer systems
Manimal: relational optimization for data-intensive programs
Procceedings of the 13th International Workshop on the Web and Databases
NoDB: efficient query execution on raw data files
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Mosquito: another one bites the data upload stream
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Commercial analytical database systems suffer from a high "time-to-first-analysis": before data can be processed, it must be modeled and schematized (a human effort), transferred into the database's storage layer, and optionally clustered and indexed (a computational effort). For many types of structured data, this upfront effort is unjustifiable, so the data are processed directly over the file system using the Hadoop framework, despite the cumulative performance benefits of processing this data in an analytical database system. In this paper we describe a system that achieves the immediate gratification of running MapReduce jobs directly over a file system, while still making progress towards the long-term performance benefits of database systems. The basic idea is to piggyback on MapReduce jobs, leverage their parsing and tuple extraction operations to incrementally load and organize tuples into a database system, while simultaneously processing the file system data. We call this scheme Invisible Loading, as we load fractions of data at a time at almost no marginal cost in query latency, but still allow future queries to run much faster.